HPC & GPU Engineering/Incident & Troubleshooting

[GPU ์žฅ์•  ๋ถ„์„ | NVSW] knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!

ygtoken 2025. 10. 11. 17:22
728x90

 

๐Ÿง  ๋กœ๊ทธ ๊ฐœ์š” (Overview)

 

์ด ๋กœ๊ทธ๋Š” NVIDIA NVLink ์„œ๋ธŒ์‹œ์Šคํ…œ์ด GPU ↔ NVSwitch ๊ฐ„ ๋งํฌ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์ง€ ๋ชปํ–ˆ์„ ๋•Œ ๋ฐœ์ƒํ•˜๋Š”

์ปค๋„ ๋ ˆ๋ฒจ NVRM ๊ฒฝ๊ณ  ๋กœ๊ทธ์ž…๋‹ˆ๋‹ค.

NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!

์ด ๋ฉ”์‹œ์ง€๋Š” NVLink ํฌํŠธ๊ฐ€ Rx Detect ๋‹จ๊ณ„์—์„œ ๊ฐ์ง€ ์‹คํŒจํ•˜๊ฑฐ๋‚˜,

PCIe ๋ฒ„์Šค ๋ฆฌ์…‹ ์ดํ›„ NVLink ์žฌ์ดˆ๊ธฐํ™”๊ฐ€ ์™„๋ฃŒ๋˜์ง€ ์•Š์•˜์„ ๋•Œ ์ฃผ๋กœ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

 

GPU ์ดˆ๊ธฐํ™”๊ฐ€ ์™„๋ฃŒ๋˜์ง€ ๋ชปํ•˜๋ฉด, KubeVirt์˜ virt-handler๊ฐ€ GPU๋ฅผ VM์— attachํ•˜๋Š” ๊ณผ์ •์—์„œ

libvirt ์ดˆ๊ธฐํ™” ์‹คํŒจ(LibvirtError) ๋ฅผ ์ผ์œผํ‚ค๋ฉฐ

VM์ด ์‹ค์ œ๋กœ ๊ธฐ๋™๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 


 

๐Ÿ“ 1. ํ˜„์ƒ (Symptoms)

 

  • ํ™˜๊ฒฝ
    • ๋ฌผ๋ฆฌ ๋…ธ๋“œ๋Š” GPU๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  Kubernetes ์›Œ์ปค๋…ธ๋“œ๋กœ ๋™์ž‘
    • GPU๋Š” VM ๋‚ด๋ถ€์—์„œ passthrough ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ
    • VM์€ KubeVirt ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ๋˜๋ฉฐ Slurm job์€ VM ๋‚ด๋ถ€์—์„œ ์‹คํ–‰
  • ๋ฐœ์ƒ ๋กœ๊ทธ
    • Host ์ปค๋„ ๋กœ๊ทธ:
2025-10-10T22:25:01.838131+09:00 gsvp-msi-gpu031 kernel: NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!

 

  • virt-handler ๋กœ๊ทธ:
LibvirtError: internal error: Unknown PCI header type '127' for device '0000:ed:00.0'

 

  • ํ˜„์ƒ ์š”์•ฝ
    • VM ์ƒ์„ฑ ์š”์ฒญ์ด ๋“ค์–ด๊ฐ”์œผ๋‚˜ VMI๋Š” ์ƒ์„ฑ๋˜์ง€ ์•Š์Œ
    • virt-handler pod์—์„œ GPU attach ์‹œ์ ์— libvirt ์ดˆ๊ธฐํ™” ์‹คํŒจ
    • VM ์ƒํƒœ๋Š” “Stopped” ๋˜๋Š” “Waiting”์œผ๋กœ ์œ ์ง€๋จ
  • ์˜ํ–ฅ
    • ํ•ด๋‹น GPU๊ฐ€ ํ• ๋‹น๋œ VM ๊ธฐ๋™ ๋ถˆ๊ฐ€
    • ๋™์ผ ๋…ธ๋“œ์˜ GPU ์Šค์ผ€์ค„๋ง ์ง€์—ฐ ๊ฐ€๋Šฅ

 


 

๐Ÿ” 2. ๋ถ„์„ ๊ณผ์ • (Investigation)

 

 

โ‘  virt-handler ๋กœ๊ทธ ํ™•์ธ (๋…ธ๋“œ ๋‹จ์œ„)

kubectl get pods -n kubevirt -l kubevirt.io=virt-handler
kubectl logs -n kubevirt virt-handler-gsvp-msi-gpu031

 

  • virt-handler๋Š” DaemonSet์œผ๋กœ ๊ฐ ๋…ธ๋“œ์— 1๊ฐœ์”ฉ ์‹คํ–‰๋˜์–ด
  • VMI lifecycle๊ณผ libvirt ์—ฐ๋™์„ ๋‹ด๋‹นํ•จ
  • ํ•ด๋‹น ๋…ธ๋“œ์˜ handler ๋กœ๊ทธ์—์„œ GPU ์ดˆ๊ธฐํ™” ์‹คํŒจ ํ™•์ธ:
LibvirtError: internal error: Unknown PCI header type '127' for device '0000:ed:00.0'

 

  • GPU ์žฅ์น˜ ์ดˆ๊ธฐํ™” ์ค‘ PCIe ํ—ค๋” ํŒ๋… ์‹คํŒจ
  • SyncFailed ์ด๋ฒคํŠธ๋Š” handler ๋‚ด๋ถ€์—์„œ ๋ฐœ์ƒํ–ˆ์œผ๋ฉฐ VMI๋Š” ์ƒ์„ฑ๋˜์ง€ ์•Š์Œ
  • VM ๊ฐ์ฒด๋Š” ์กด์žฌํ•˜์ง€๋งŒ libvirt attach ๋‹จ๊ณ„์—์„œ ์‹คํŒจ

 

 

โ‘ก VirtualMachine (VM) ๋ฆฌ์†Œ์Šค ์ƒํƒœ ํ™•์ธ

kubectl get vm -n <namespace>
kubectl describe vm <VM_NAME> -n <namespace>

 

  • ์„ ์–ธ๋œ VM ์ŠคํŽ™์€ ์ •์ƒ์ด๋‚˜, ์ƒํƒœ(Status)๊ฐ€ Stopped ๋˜๋Š” Waiting
  • ์ด๋ฒคํŠธ์—๋Š” VMI ์ƒ์„ฑ ๊ด€๋ จ ์‹คํŒจ ๋ฉ”์‹œ์ง€๊ฐ€ ํฌํ•จ๋จ

 

 

โ‘ข ํ˜ธ์ŠคํŠธ ์ปค๋„ ๋กœ๊ทธ ํ™•์ธ

journalctl -k -n 300 | grep -i nvrm

 

  • NVLink ๊ด€๋ จ ์—๋Ÿฌ ๋ฐ˜๋ณต ๋ฐœ์ƒ:
NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!

 

  • → GPU ↔ NVSwitch ๊ฐ„ ๋งํฌ ๊ฐ์ง€ ์‹คํŒจ๋กœ ์žฅ์น˜๊ฐ€ “partial initialized” ์ƒํƒœ

 

 

โ‘ฃ GPU ๋ฐ PCIe ์ƒํƒœ ์ ๊ฒ€

nvidia-smi -L
nvidia-smi -q -d NVLINK | grep -E "Link [0-9]+:|Errors"
lspci -vvv | grep -i nvidia -A 10

 

  • GPU ๋””๋ฐ”์ด์Šค๋Š” PCIe ์ƒ์—์„œ ์กด์žฌํ•˜์ง€๋งŒ NVLink๋Š” Inactive
  • NVLink bring-up ์‹คํŒจ → GPU attach ์‹œ์ ์—์„œ libvirt ์˜ค๋ฅ˜ ๋ฐœ์ƒ

 


 

โš™๏ธ 3. ์›์ธ (Root Cause)

 

  • NVLink ๊ฐ์ง€(Rx Detect) ๋‹จ๊ณ„ ์‹คํŒจ
  • GPU ↔ NVSwitch ๊ฐ„ ๋งํฌ๊ฐ€ ์žฌํ˜•์„ฑ๋˜์ง€ ์•Š์•„ NVLink ์ƒํƒœ ๊ฐฑ์‹  ์‹คํŒจ
  • libvirt attach ํƒ€์ด๋ฐ ๋ถˆ์ผ์น˜
  • GPU ์ดˆ๊ธฐํ™” ์™„๋ฃŒ ์ „์— VM attach ์‹œ๋„๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ PCI ํ—ค๋” ํŒ๋… ์˜ค๋ฅ˜ ๋ฐœ์ƒ
  • PCIe bus hot reset ๋ถˆ์™„์ „
  • GPU๊ฐ€ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์ธ์‹๋˜์ง€๋งŒ NVLink bring-up ๋ฏธ์™„๋ฃŒ ์ƒํƒœ๋กœ ๋‚จ์Œ

 


 

๐Ÿงฉ 4. ์กฐ์น˜ ๋ฐ ๋Œ€์‘ (Actions)

 

 

โœ… ์‹ค์ œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• (Confirmed Resolution)

 

  • ์ฝœ๋“œ ๋ฆฌ๋ถ€ํŠธ (Cold Reboot)
    • ๋ฌผ๋ฆฌ ๋…ธ๋“œ์˜ ์ „์›์„ ์™„์ „ํžˆ ์ฐจ๋‹จ ํ›„ ์žฌ๊ธฐ๋™
    • PCIe ๋ฐ NVLink๊ฐ€ ํ•˜๋“œ์›จ์–ด ๋ ˆ๋ฒจ์—์„œ ์™„์ „ ์žฌ์ดˆ๊ธฐํ™”๋จ
    • VM(VMI)์€ ์‚ญ์ œํ•˜์ง€ ์•Š์•˜์œผ๋ฉฐ, ํ˜ธ์ŠคํŠธ ์žฌ๊ธฐ๋™ ํ›„ GPU ์ž๋™ ๋ณต๊ตฌ๋จ
sudo shutdown -h now
# ์ „์› OFF → ON ์ดํ›„ ์ƒํƒœ ํ™•์ธ
nvidia-smi -q -d NVLINK | grep Active

 

  • ๋ชจ๋“  NVLink ๋งํฌ Active, VM ์ •์ƒ ๊ธฐ๋™ ํ™•์ธ

 


 

๐Ÿงญ ์ฐธ๊ณ ์šฉ ๋Œ€์ฒด ์กฐ์น˜ (Alternative / Workaround)

 

๋ณธ ์ผ€์ด์Šค์—์„œ๋Š” ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์•˜์œผ๋‚˜, ์œ ์‚ฌ ์ฆ์ƒ ์‹œ ์‹œ๋„ ๊ฐ€๋Šฅํ•œ ์ผ๋ฐ˜ ์กฐ์น˜

 

 

  1. GPU ๋“œ๋ผ์ด๋ฒ„ ์žฌ์ ์žฌ
sudo systemctl stop nvidia-persistenced
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia nvidia_modeset nvidia_drm nvidia_uvm
sudo systemctl start nvidia-persistenced

 

  • ์ผ๋ถ€ ํ™˜๊ฒฝ์—์„œ NVLink ๋ณต๊ตฌ ๊ฐ€๋Šฅํ•˜๋‚˜ ๊ทผ๋ณธ์  ํ•ด๊ฒฐ์ฑ…์€ ์•„๋‹˜.
  1. ๋ฒ„์ „ ์ •๋ ฌ ์ ๊ฒ€
    • NVSwitch FW, GPU FW, Driver ๊ฐ„ ๋ฒ„์ „ ํ˜ธํ™˜์„ฑ ํ™•์ธ
    • ์ปค๋„ ์—…๋ฐ์ดํŠธ ํ›„ ๋ฐœ์ƒ ์‹œ ๋“œ๋ผ์ด๋ฒ„ rollback ๊ณ ๋ ค

 


 

๐Ÿ“ˆ 5. ๊ฒ€์ฆ ๊ฒฐ๊ณผ (Verification)

 

  • ํ˜ธ์ŠคํŠธ
nvidia-smi
nvidia-smi -q -d NVLINK | grep Active

 

  • → NVLink Active ์ƒํƒœ ํ™•์ธ, ์ปค๋„ ๋กœ๊ทธ์—์„œ ์žฌ๋ฐœ ์—†์Œ
  • VM
    • VM ๊ธฐ๋™ ์™„๋ฃŒ (kubectl get vm → Running)
    • GPU passthrough ์ •์ƒ ์ž‘๋™ (nvidia-smi ์ •์ƒ ์‘๋‹ต)

 

728x90