HPC & GPU Engineering/Incident & Troubleshooting

[GPU ์žฅ์•  ๋ถ„์„ Xid 137 + 145] RLW_RXPIPE Interrupt & Nonfatal Retry — NVLink Rx ํŒŒ์ดํ”„๋ผ์ธ ์ด๋ฒคํŠธ

ygtoken 2025. 10. 12. 19:02
728x90

 

๐Ÿง  ๋กœ๊ทธ ๊ฐœ์š” (Overview)

2025-10-12T18:37:46.098195+09:00 gsvp-msi-gpu052 kernel:
NVRM: Xid (PCI:0000:db:00): 137, pid=202344, name=pt_nccl_watchdg, RLW_RXPIPE interrupt hit on link 0 on GPU0: PRIV Error
2025-10-12T18:37:46.098193+09:00 gsvp-msi-gpu052 kernel:
NVRM: Xid (PCI:0000:db:00): 145, pid=202344, name=pt_nccl_watchdg, RLW_RXPIPE Nonfatal XC0 i0 Link 00 (0x04080006 0x00000008 0x00000000 0x00000000 0x00000000 0x00000000)

 

  • ๊ณตํ†ต ์ •๋ณด
    • Bus-ID 0000:db:00.0 → GPU index 5 (B200 ๋ชจ๋ธ)
    • pid/name = 202344/pt_nccl_watchdg → PyTorch NCCL Watchdog ํ”„๋กœ์„ธ์Šค
    • ๋‘ Xid๊ฐ€ ๋™์ผ ์‹œ๊ฐ„ ๋Œ€ ๊ธฐ๋ก๋จ → ํ•˜๋‚˜์˜ NVLink Rx Pipeline ์ด๋ฒคํŠธ๋กœ ๊ฐ„์ฃผ
  • Xid 137 : NVLink Rx Pipe privilege exception ๋ฐœ์ƒ
  • Xid 145 : ๋™์ผ ๋งํฌ์—์„œ ์žฌ์‹œ๋„(retry) ๊ณผ์ • ์ค‘ ๋น„์น˜๋ช…์  Non-fatal ์‹ ํ˜ธ

 


 

๐Ÿ“ 1. ํ˜„์ƒ (Symptoms)

 

  • GPU 5๋ฒˆ (NVLink Link 0) ์—์„œ 137 → 145 ์ˆœ์œผ๋กœ ์—ฐ์† ๋กœ๊ทธ ๊ธฐ๋ก
  • NCCL ์›Œํฌ๋กœ๋“œ๋Š” ์ค‘๋‹จ ์—†์ด ์ง„ํ–‰๋˜์—ˆ์œผ๋ฉฐ ์„ฑ๋Šฅ ์ €ํ•˜๋‚˜ Job Fail ์—†์Œ
  • ์‹œ์Šคํ…œ ๋˜๋Š” VM ๋ ˆ๋ฒจ ์˜ํ–ฅ ์—†์Œ

 


 

๐Ÿ” 2. ๋ถ„์„ ๊ณผ์ • (Investigation)

 

 

โ‘  ๋ฌธ์ œ GPU ์‹๋ณ„

nvidia-smi
| GPU 5 | Bus-Id 00000000:DB:00.0 | PID 202344 | /usr/bin/python |

→ Xid ๋กœ๊ทธ์™€ PID ๋งค์นญ → GPU index 5 ํ™•์ธ

 

 

โ‘ก NVLink ์ƒํƒœ ์ ๊ฒ€

nvidia-smi nvlink -s -i 5
GPU 5: NVIDIA B200
    Link 0-17: 50 GB/s (๋ชจ๋‘ Active)

→ ๋ชจ๋“  ๋งํฌ ์ •์ƒ ๋Œ€์—ญํญ, ํ•˜๋“œ์›จ์–ด ๋งํฌ ๋‹ค์šด ์•„๋‹˜

 

 

โ‘ข ์ปค๋„ ๋กœ๊ทธ ์ฃผ๋ณ€ ์ƒ๊ด€ ๋ถ„์„

journalctl -k --since "2025-10-12 18:35:00" --until "2025-10-12 18:40:00" | grep -Ei "Xid|NVRM|nvlink"

 

  • 137 → 145 ์ˆœ์œผ๋กœ 1ํšŒ์„ฑ ์ด๋ฒคํŠธ ๊ธฐ๋ก
  • ๋‹ค๋ฅธ Xid (79, 109 ๋“ฑ) ๋™๋ฐ˜ ์—†์Œ

 

 

โ‘ฃ NCCL ๋ ˆ๋ฒจ ํ™•์ธ (์„ ํƒ)

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH

 

  • ๋™์ผ ํƒ€์ž„์Šคํƒฌํ”„์— NCCL WARN/timeout ๋กœ๊ทธ ์—†์Œ → ์†Œํ”„ํŠธ ๋ฆฌ์ปค๋ฒ„๋ฆฌ ์™„๋ฃŒ

 


 

โš™๏ธ 3. ์›์ธ (Root Cause)

 

  • NVLink Rx Pipeline Privilege Error (PRIV Error)
    • NCCL ํ†ต์‹  ์ค‘ NVLink ํ•˜๋“œ์›จ์–ด๊ฐ€ register transaction ์˜ˆ์™ธ๋ฅผ ๊ฐ์ง€
  • ๋“œ๋ผ์ด๋ฒ„ ๋‚ด๋ถ€ ๋น„์น˜๋ช…์  ๋ณต๊ตฌ ์ ˆ์ฐจ (Non-fatal Retry)
    • ํ•˜๋“œ์›จ์–ด ๋ ˆ๋ฒจ์—์„œ ์žฌ์‹œ๋„ ํŠธ๋ฆฌ๊ฑฐ → Xid 145 ๊ธฐ๋ก
  • GPU Reset ๋˜๋Š” ๋งํฌ Down ์—†์Œ
    • ์†Œํ”„ํŠธ์›จ์–ด ๋ ˆ๋ฒจ ๋ณต๊ตฌ ์„ฑ๊ณต, ์ž‘์—… ์ง€์†

 


 

 

๐Ÿงฉ 4. ์กฐ์น˜ ๋ฐ ๋Œ€์‘ (Actions)

 

โœ… ํ˜„์žฌ ์ƒํƒœ (Observed Outcome)

  • ๋‹จ์ผ ์‹œ๊ฐ„๋Œ€ 1ํšŒ์„ฑ ์ด๋ฒคํŠธ
  • GPU 5๋ฒˆ ๋ฐ ์›Œํฌ๋กœ๋“œ ์ •์ƒ ์ง„ํ–‰
  • nvidia-smi nvlink -s -i 5 ๊ฒฐ๊ณผ ๋ชจ๋“  ๋งํฌ Active (50 GB/s)
  • ์ถ”๊ฐ€ Xid ์ด๋ฒคํŠธ (79, 109 ๋“ฑ) ๋ฏธ๋ฐœ์ƒ

 

 

๐Ÿงญ ๊ถŒ์žฅ ์šด์˜ ์กฐ์น˜ (Operational Guidance)

 

1๏ธโƒฃ ์žฌ๋ฐœ ๋ชจ๋‹ˆํ„ฐ๋ง

sudo journalctl -k | grep "E[Xx]id (PCI:0000:db:00)"

 

  • ๋™์ผ GPU ๋˜๋Š” Link 0 ๋ฐ˜๋ณต ๋ฐœ์ƒ ์‹œ ์ถ”์  ์ง€์†

 

2๏ธโƒฃ NVLink ์ƒํƒœ ํ™•์ธ (580 ์ด์ƒ ํ™˜๊ฒฝ)

nvidia-smi nvlink -i 5 --status

 

  • ๊ฐ ๋งํฌ์˜ ๋Œ€์—ญํญ(50 GB/s)๊ณผ Active ์—ฌ๋ถ€๋ฅผ ์ง์ ‘ ํ™•์ธ

 

3๏ธโƒฃ DCGM ๊ธฐ๋ฐ˜ ํ—ฌ์Šค์ฒดํฌ

sudo dcgmi diag -r 3
sudo dcgmi health

 

  • NVLink integrity, ECC, PCIe ์ƒํƒœ ๋“ฑ ์ข…ํ•ฉ ์ง„๋‹จ ์ˆ˜ํ–‰

 

4๏ธโƒฃ ์ง€์† ๋ฐœ์ƒ ์‹œ ์กฐ์น˜

 

  • ๋“œ๋ผ์ด๋ฒ„ ๋ฐ NVSwitch FW ๋ฒ„์ „ ์ •๋ ฌ ํ™•์ธ
  • ์ฝœ๋“œ ๋ฆฌ๋ถ€ํŠธ๋กœ ํ•˜๋“œ์›จ์–ด ๋ ˆ๋ฒจ ์ดˆ๊ธฐํ™”
  • ํŠน์ • ํฌํŠธ์—์„œ ๋ฐ˜๋ณต ๋ฐœ์ƒ ์‹œ ํ•˜๋“œ์›จ์–ด ์ ๊ฒ€ ๋˜๋Š” ๊ต์ฒด

 


 

๐Ÿ“ˆ 5. ๊ฒ€์ฆ ๊ฒฐ๊ณผ (Verification)

sudo journalctl -k | grep "E[Xx]id (PCI:0000:db:00)"
# ์ถœ๋ ฅ ์—†์Œ → ๋™์ผ Bus-ID์—์„œ Xid 137/145 ์žฌ๋ฐœ ์—†์Œ

nvidia-smi nvlink -i 5 --status
# ๋ชจ๋“  ๋งํฌ Active / 50 GB/s ์œ ์ง€
โœ… NVLink ๋งํฌ Active
โœ… GPU reset / NCCL timeout ๋ฏธ๋ฐœ์ƒ
โœ… ์›Œํฌ๋กœ๋“œ ์ง€์† ์ •์ƒ, ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์Œ

 

 

728x90