HPC & GPU Engineering/Platform Essentials

๊ณ ์„ฑ๋Šฅ AI ์ปดํ“จํŒ… ์ธํ”„๋ผ ๊ธฐ์ˆ  ํ‚ค์›Œ๋“œ ์ •๋ฆฌ (v1.0)

ygtoken 2025. 8. 3. 14:54
728x90

 

๐Ÿ”ท 1. HPC ์ธํ”„๋ผ ๋ฐ ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
๊ฐ€์†๊ธฐ ํ•˜๋“œ์›จ์–ด Accelerator GPU, NPU, Gaudi, MI250 ๋“ฑ ๊ณ ์† ์—ฐ์‚ฐ์„ ์œ„ํ•œ ํŠน์ˆ˜ ์—ฐ์‚ฐ ์žฅ์น˜
๋ฉ”๋ชจ๋ฆฌ HBM (High Bandwidth Memory) GPU ๋‚ด๋ถ€ ๊ณ ๋Œ€์—ญํญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜
์นด๋“œ/์„œ๋ฒ„ ๊ตฌ์กฐ PCIe ์นด๋“œ, PCIe Passthrough GPU๋ฅผ PCIe ์Šฌ๋กฏ์— ์žฅ์ฐฉํ•˜๊ฑฐ๋‚˜ VM์— ์ง์ ‘ ํ• ๋‹นํ•˜๋Š” ๊ธฐ์ˆ 
์‹คํ–‰ ํ™˜๊ฒฝ Baremetal, Service VM ๊ฐ€์ƒํ™” ์—†์ด ๋ฌผ๋ฆฌ ์„œ๋ฒ„๋ฅผ ์ง์ ‘ ์šด์˜ํ•˜๊ฑฐ๋‚˜ ์ œ์–ด ์ „์šฉ VM ๊ตฌ์„ฑ
๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ NUMA CPU-GPU ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ตฌ์กฐ๋กœ, ์ตœ์  ๋ฐฐ์น˜ ๋ฐ ์„ฑ๋Šฅ ๋ถ„์„์— ์ค‘์š”
์ „๋ ฅ/๋ฐœ์—ด Power Budget, Power Density, Cooling Power Overhead, TDP ๋ฐ์ดํ„ฐ์„ผํ„ฐ ์ „๋ ฅ ์„ค๊ณ„ ์‹œ ๊ณ ๋ ค๋˜๋Š” ์†Œ๋น„๋Ÿ‰๊ณผ ๋ฐœ์—ด ๊ฐ’, ๋ƒ‰๊ฐ ๋น„์šฉ ๋“ฑ์„ ํฌํ•จํ•œ ์„ค๊ณ„ ์š”์†Œ
ํ†ตํ•ฉ ์Šคํƒ Vertical Integration, Cross-layer Optimization ํ•˜๋“œ์›จ์–ด๋ถ€ํ„ฐ ์†Œํ”„ํŠธ์›จ์–ด, ํ”„๋ ˆ์ž„์›Œํฌ, ์•Œ๊ณ ๋ฆฌ์ฆ˜๊นŒ์ง€ ์ˆ˜์ง์ ์œผ๋กœ ํ†ตํ•ฉ ์ตœ์ ํ™”๋œ ๊ตฌ์กฐ
GPU ๊ฐ€์ƒํ™” MIG (Multi-Instance GPU) NVIDIA A100/H100์—์„œ ์ง€์›๋˜๋Š” GPU ๋…ผ๋ฆฌ ๋ถ„ํ•  ๊ธฐ๋Šฅ์œผ๋กœ ๋‹ค์ค‘ ์›Œํฌ๋กœ๋“œ๋ฅผ ์ง€์›
๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ Lustre, BeeGFS, Spectrum Scale HPC/AI ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ณ ์„ฑ๋Šฅ ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์œผ๋กœ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ์ ํ•ฉ

๐Ÿ”ท 2. ์Šค์ผ€์ค„๋ง ๋ฐ ๋ถ„์‚ฐ ํ•™์Šต ์ „๋žต

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
ํ†ต์‹  ๊ตฌ์กฐ Collective Communication, Ring Topology GPU ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๊ตฌ์กฐ๋กœ ํ‰๊ท , sum, sync ๋“ฑ์„ ์œ„ํ•œ ๋ฉ”์‹œ์ง€ ๊ตํ™˜ ๋ฐฉ์‹
๋ถ„์‚ฐ ํ•™์Šต Distributed Training, Parameter Synchronization ์—ฌ๋Ÿฌ GPU์— ๋ชจ๋ธ์„ ๋ถ„์‚ฐ์‹œ์ผœ ๋ณ‘๋ ฌ๋กœ ํ•™์Šตํ•˜๊ณ , ์ฃผ๊ธฐ์ ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ธฐํ™”ํ•˜๋Š” ๊ตฌ์กฐ
์ •๋ฐ€๋„ ์ตœ์ ํ™” Mixed Precision Training FP16, BF16 ๋“ฑ์„ ํ˜ผ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•
์Šค์ผ€์ค„๋ง ์ „๋žต Job Completion Time, Gang Scheduling, Elastic Scheduling, Preemption ์ „์ฒด ์ž‘์—…์˜ ์™„๋ฃŒ ์‹œ๊ฐ„ ์ธก์ •, ๋™์‹œ ์‹คํ–‰ ๋ณด์žฅ, ์„ ์  ์‹คํ–‰ ๋“ฑ ๋‹ค์–‘ํ•œ GPU ์Šค์ผ€์ค„๋ง ์ „๋žต
๋ถˆ๊ท ํ˜• ํƒ์ง€ Imbalance Detection GPU ๊ฐ„ ์ž‘์—…๋Ÿ‰/ํ†ต์‹ ๋Ÿ‰/๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ๋ถˆ๊ท ํ˜•์„ ๊ฐ์ง€ํ•˜์—ฌ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ์ „๋žต

๐Ÿ”ท 3. ๋„คํŠธ์›Œํ‚น ๋ฐ GPU ํŒจ๋ธŒ๋ฆญ

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
GPU ์ธํ„ฐ์ปค๋„ฅํŠธ NVLink (MVLink), GPU Fabric ๊ณ ์† GPU ๊ฐ„ ํ†ต์‹  ๊ตฌ์กฐ๋กœ, ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋ฐ ๋Œ€์šฉ๋Ÿ‰ ๋ชจ๋ธ ์—ฐ์‚ฐ์— ์ตœ์ ํ™”๋จ
๊ณ ์† ํ†ต์‹  ๊ธฐ์ˆ  RDMA, GPUDirect RDMA CPU ๊ฐœ์ž… ์—†์ด NIC๋ฅผ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฐ„ ์ง์ ‘ ๋ฐ์ดํ„ฐ ์ „์†ก ๊ฐ€๋Šฅ
NIC/์ธํ„ฐํŽ˜์ด์Šค High-speed NIC, SR-IOV, VF GPU์šฉ ๊ณ ์† NIC(200/400Gbps), ๊ฐ€์ƒ NIC ์ƒ์„ฑ ๊ธฐ์ˆ  ํฌํ•จ
ํŠธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ Offloading, Link Aggregation ๋„คํŠธ์›Œํฌ ๋ถ€ํ•˜๋ฅผ NIC์—์„œ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ์—ฌ๋Ÿฌ NIC๋ฅผ ๋ฌถ์–ด ๋Œ€์—ญํญ ์ฆ๊ฐ€
์Šค์œ„์น˜ ์ง€๋Šฅํ™” In-Network Computing SmartNIC/์Šค์œ„์น˜์—์„œ ์—ฐ์‚ฐ์„ ์ผ๋ถ€ ์ฒ˜๋ฆฌํ•˜์—ฌ ์ง€์—ฐ์„ ์ค„์ด๋Š” ๊ธฐ์ˆ 
DC ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ Leaf-Spine Topology, OVS/OVN ๋ณ‘๋ชฉ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๋„คํŠธ์›Œํฌ ์„ค๊ณ„ ๋ฐ ์˜คํ”ˆ์†Œ์Šค ๊ฐ€์ƒ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ
์ตœ์‹  ํ•˜๋“œ์›จ์–ด SmartNIC, UCIe, Liquid Cooling ์ฐจ์„ธ๋Œ€ ๊ณ ์„ฑ๋Šฅ ์ธํ”„๋ผ๋ฅผ ์œ„ํ•œ ์ธํ„ฐํŽ˜์ด์Šค ๋ฐ ๋ƒ‰๊ฐ ๊ธฐ์ˆ , ๋ชจ๋“ˆํ˜• ์นฉ ๊ตฌ์กฐ ๊ธฐ์ˆ 

๐Ÿ”ท 4. ๊ฐ€์ƒํ™” ๋ฐ ํด๋Ÿฌ์Šคํ„ฐ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
๊ฐ€์ƒํ™” ๊ธฐ์ˆ  GPU Virtualization, GPU Passthrough, Hypervisor-based VM, vGPU VM์— GPU๋ฅผ ์ง์ ‘ ์—ฐ๊ฒฐํ•˜๊ฑฐ๋‚˜ ๊ฐ€์ƒ GPU(vGPU)๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๋‹ค์ค‘ ์‚ฌ์šฉ์ž ์ง€์›
๋ฆฌ์†Œ์Šค ๊ฒฉ๋ฆฌ GPU Resource Isolation, Tenant-level Resource Segregation ์‚ฌ์šฉ์ž ๊ฐ„ GPU ๊ฐ„์„ญ์„ ๋ฐฉ์ง€ํ•˜๋Š” ์ž์› ๋ถ„๋ฆฌ ์ „๋žต
์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ Operator Pattern, Resource Orchestration K8s ๊ธฐ๋ฐ˜ ์ž์›์˜ ์ƒ์„ฑ, ํ™•์žฅ, ๋ณต๊ตฌ๋ฅผ ์ž๋™ํ™”ํ•˜๋Š” ๊ตฌ์กฐ
์ปค์Šคํ…€ ๋””๋ฐ”์ด์Šค ์—ฐ๋™ K8s Device Plugin K8s์—์„œ GPU, FPGA ๋“ฑ์˜ ํŠน์ˆ˜ ์ž์›์„ ์ธ์‹์‹œํ‚ค๋Š” ํ™•์žฅ ๋ชจ๋“ˆ
์ด๊ธฐ์ข… ์ž์› ํ†ตํ•ฉ GPU + Custom AI Accelerator Cluster GPU, NPU, DPU ๋“ฑ ๋‹ค์–‘ํ•œ ์—ฐ์‚ฐ ์ž์›์„ ํ•˜๋‚˜์˜ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ํ†ตํ•ฉ ์šด์˜ํ•˜๋Š” ๊ตฌ์กฐ

๐Ÿ”ท 5. AI ์ธํ”„๋ผ ๋ฐ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ์ „๋ ฅ ์„ค๊ณ„

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
AIDC ๊ตฌ์กฐ AI Data Center (AIDC) AI ์›Œํฌ๋กœ๋“œ ์ „์šฉ์œผ๋กœ ์„ค๊ณ„๋œ ๊ณ ๋ฐ€๋„ ์—ฐ์‚ฐ ํŠนํ™” ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๊ตฌ์กฐ
์ „๋ ฅ ์„ค๊ณ„ 35kW ๋ž™, Worst-case Power Design, Power Saving Chain ํ”ผํฌ ๋ถ€ํ•˜๋ฅผ ๊ณ ๋ คํ•œ ์ „๋ ฅ ์„ค๊ณ„ ๋ฐ ๋ˆ„์  ์ ˆ์ „ ๊ตฌ์กฐ ์ ์šฉ
์ „์› ๋ถ€ํ’ˆ Inductor / PMIC ์ „์•• ์•ˆ์ •ํ™” ๋ฐ ์ „๋ ฅ ํšจ์œจํ™”์— ์‚ฌ์šฉ๋˜๋Š” ์•„๋‚ ๋กœ๊ทธ ์ „๋ ฅ ๋ถ€ํ’ˆ
์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ๋ง Device-level Signal Monitoring, GPU Telemetry GPU ์˜จ๋„, ํŒฌ์†, ์ „๋ ฅ ๋“ฑ์˜ ์ƒํƒœ ์ •๋ณด๋ฅผ ์‹ค์‹œ๊ฐ„ ์ˆ˜์ง‘ํ•˜๋Š” ๊ตฌ์กฐ

๐Ÿ”ท 6. ํด๋Ÿฌ์Šคํ„ฐ ๊ด€๋ฆฌ ๋ฐ ์šด์˜ ์ „๋žต

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
๊ณต๊ธ‰๋ง ์ „๋žต GPU Supply Chain Management, Secondary Sourcing ํŠน์ • ๋ฒค๋” ์ข…์†๋„๋ฅผ ๋‚ฎ์ถ”๊ณ , ์•ˆ์ •์  GPU ํ™•๋ณด๋ฅผ ์œ„ํ•œ ์ „๋žต
ํด๋Ÿฌ์Šคํ„ฐ ๊ด€๋ฆฌ Cluster Manager, Monitoring / Allocation / Deployment GPU, VM, ์ปจํ…Œ์ด๋„ˆ ๋“ฑ์˜ ์ƒํƒœ๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๋Š” ์‹œ์Šคํ…œ
ํ™œ์šฉ๋„ ์ตœ์ ํ™” Utilization Optimization, QoS, Idle GPU Detection ์œ ํœด ์ž์›์„ ์ž๋™ ํšŒ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ• ๋‹น ์ตœ์ ํ™”
์Šค์ผ€์ค„๋ง ํ†ตํ•ฉ Inference Scheduling / VM Orchestration ์ถ”๋ก  ์š”์ฒญ์„ ๋‹ค์–‘ํ•œ VM์— ๋ถ„์‚ฐ์‹œํ‚ค๊ณ , ์ž์› ํ• ๋‹น์„ ์ž๋™ํ™”ํ•˜์—ฌ GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ์ „๋žต
๋…ธ๋“œ ์ •์ฑ… Node Affinity / Anti-Affinity ์›Œํฌ๋กœ๋“œ๋ฅผ ํŠน์ • ๋…ธ๋“œ์— ์ง‘์ค‘ ๋˜๋Š” ๋ถ„์‚ฐ์‹œ์ผœ ์ž์› ํ™œ์šฉ ๊ทน๋Œ€ํ™” ๋ฐ ์žฅ์•  ํšŒํ”ผ

๐Ÿ”ท 7. AI ํ”Œ๋žซํผ ๋ฐ ์›Œํฌํ”Œ๋กœ์šฐ ์ž๋™ํ™”

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
ํ”Œ๋žซํผ ๋ชจ๋ธ GPUaaS, Hybrid Cloud, IaaS/PaaS/SaaS GPU๋ฅผ ๊ตฌ๋…ํ˜• ํด๋ผ์šฐ๋“œ ์ž์›์œผ๋กœ ์ œ๊ณตํ•˜๋Š” ํ˜•ํƒœ
ํ•™์Šต ํ”„๋กœ์„ธ์Šค Model Training / Tuning / Inference, Precision Format AI ๋ชจ๋ธ ํ•™์Šต~์ถ”๋ก ์˜ ์ „์ฒด ํ๋ฆ„ ๋ฐ ์ •๋ฐ€๋„ ํ˜•์‹(FP32/BF16 ๋“ฑ)
ํŒŒ์ดํ”„๋ผ์ธ AIOps / MLOps, Airflow / Kubeflow ๋ชจ๋ธ ๊ฐœ๋ฐœ, ๋ฐฐํฌ, ๋ชจ๋‹ˆํ„ฐ๋ง์„ ์ž๋™ํ™”ํ•˜๊ณ , ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ์‹œ์Šคํ…œ
๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ Data Labeling / Preprocessing, Vector Indexing AI ํ•™์Šต์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ฃผ์„/์ „์ฒ˜๋ฆฌ์™€ ์ž„๋ฒ ๋”ฉ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ์ธ๋ฑ์‹ฑ ๊ตฌ์กฐ
ํ•™์Šต ๋ณต๊ตฌ Checkpointing ์žฅ์‹œ๊ฐ„ ํ•™์Šต ์‹œ ์ค‘๋‹จ์— ๋Œ€๋น„ํ•ด ์ค‘๊ฐ„ ์ƒํƒœ๋ฅผ ์ €์žฅํ•˜๊ณ  ์žฌ์‹œ์ž‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ
๋ฒ„์ „ ๊ด€๋ฆฌ Model Versioning ๋ชจ๋ธ์„ ๋ฒ„์ „ ๋‹จ์œ„๋กœ ๊ด€๋ฆฌํ•˜๊ณ , ์‹คํ—˜/๋ฐฐํฌ/๋กค๋ฐฑ์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋Š” ์ฒด๊ณ„
์ธํ„ฐํŽ˜์ด์Šค Web-based AI Platform, Model Repository ๋ธŒ๋ผ์šฐ์ € ๊ธฐ๋ฐ˜์˜ ํ•™์Šต ํ™˜๊ฒฝ ๋ฐ ๋ชจ๋ธ์˜ ์ €์žฅ/์žฌ์‚ฌ์šฉ์„ ์œ„ํ•œ ์ €์žฅ์†Œ

๐Ÿ”ท 8. ๋น„์šฉ ์ตœ์ ํ™” ๋ฐ ์šด์˜ ๋„๊ตฌ

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
๋น„์šฉ ๋ถ„์„ Usage-based Billing, TCO Optimization, Cost Reporting ์‚ฌ์šฉ๋Ÿ‰ ๊ธฐ๋ฐ˜ ๊ณผ๊ธˆ ๋ฐ ์ธํ”„๋ผ ์šด์˜์˜ ์ด ์†Œ์œ  ๋น„์šฉ ์ ˆ๊ฐ ์ „๋žต
์ธ์Šคํ„ด์Šค ์ „๋žต Reserved Instance Optimization, Spot Instance ์žฅ๊ธฐ ์˜ˆ์•ฝ ์ธ์Šคํ„ด์Šค ๋˜๋Š” ์œ ํœด ์ž์› ๊ธฐ๋ฐ˜์˜ ์ŠคํŒŸ ์ธ์Šคํ„ด์Šค๋ฅผ ํ†ตํ•œ ๋น„์šฉ ์ ˆ๊ฐ
์„ฑ๋Šฅ ๋ถ„์„ GPU Utilization Metrics, Bottleneck Analysis GPU ์ž์›์˜ ์‚ฌ์šฉ๋ฅ , ๋ณ‘๋ชฉ ์ง€์  ๋“ฑ์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ ๋งˆ๋ จ
์ด์ƒ ํƒ์ง€ AI-based Anomaly Detection ๋น„์ •์ƒ์ ์ธ GPU ์‚ฌ์šฉ๋Ÿ‰ ๋˜๋Š” ์š”๊ธˆ ๊ธ‰๋“ฑ ํŒจํ„ด์„ ๊ฐ์ง€ํ•˜์—ฌ ๊ฒฝ๊ณ  ๋ฐ ์ œ์–ด
์žฌ๋ฐฐ์น˜ Auto-resizing / Reallocation ์ž์› ์žฌ์กฐ์ • ๋ฐ ๋ฆฌ์‚ฌ์ด์ง•์„ ํ†ตํ•ด ํด๋Ÿฌ์Šคํ„ฐ ํ™œ์šฉ๋ฅ ์„ ๋†’์ด๋Š” ์šด์˜ ์ „๋žต
์Šค์ผ€์ผ๋ง AutoScaler (GPU-aware) GPU ์‚ฌ์šฉ๋Ÿ‰ ๋˜๋Š” ์›Œํฌ๋กœ๋“œ ์ˆ˜์š”์— ๋”ฐ๋ผ Pod/๋…ธ๋“œ ์ˆ˜๋ฅผ ์ž๋™ ์กฐ์ ˆํ•˜๋Š” ๊ธฐ๋Šฅ
๊ด€์ฐฐ์„ฑ ๋„๊ตฌ Prometheus, Grafana, AlertManager, OpenTelemetry ์‹ค์‹œ๊ฐ„ ๋ฉ”ํŠธ๋ฆญ ์ˆ˜์ง‘ ๋ฐ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ์˜คํ”ˆ์†Œ์Šค ๊ด€์ฐฐ์„ฑ ๋„๊ตฌ

๐Ÿ”ท 9. GPU ์ปดํŒŒ์ผ๋Ÿฌ ๋ฐ ๋””๋ฒ„๊น… ๋„๊ตฌ

๋ถ„๋ฅ˜ ํ‚ค์›Œ๋“œ ์„ค๋ช…
์ปดํŒŒ์ผ๋Ÿฌ ROCm, AOCC, PGI Compiler, OpenACC AMD ๋ฐ NVIDIA์šฉ ๊ณ ์„ฑ๋Šฅ GPU ์ปดํŒŒ์ผ๋Ÿฌ ๋ฐ ์˜คํ”„๋กœ๋“œ ๊ธฐ์ˆ 
๋””๋ฒ„๊น… nvidia-smi, cuda-gdb, nsys, nvprof GPU ์ƒํƒœ ํ™•์ธ, ๋””๋ฒ„๊น…, ์„ฑ๋Šฅ ๋ถ„์„์„ ์œ„ํ•œ NVIDIA ๋„๊ตฌ ๋ชจ์Œ

 

 

===============================================================

 

 

๐Ÿ”ท 1. HPC ์ธํ”„๋ผ ๋ฐ ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ

 

  • Accelerator: GPU, NPU, Gaudi, MI250 ๋“ฑ ํŠน์ˆ˜ ์—ฐ์‚ฐ ์žฅ์น˜
  • HBM (High Bandwidth Memory): GPU ๋‚ด ๊ณ ๋Œ€์—ญํญ ๋ฉ”๋ชจ๋ฆฌ
  • PCIe / Passthrough: PCIe ์Šฌ๋กฏ GPU ์žฅ์ฐฉ, VM ์ง์ ‘ ํ• ๋‹น
  • Baremetal / Service VM: ๋ฌผ๋ฆฌ ์„œ๋ฒ„ ์ง์ ‘ ์‹คํ–‰ / VM ๊ธฐ๋ฐ˜
  • NUMA ๊ตฌ์กฐ: CPU-GPU ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ง€์—ฐ ์ตœ์ ํ™” ํ•„์š”
  • Power Budget / TDP / Cooling: ์ „๋ ฅ ๋ฐ ๋ฐœ์—ด ์„ค๊ณ„ ๊ณ ๋ ค ์š”์†Œ
  • Vertical Integration: HW~SW๊นŒ์ง€ ์ˆ˜์ง ์ตœ์ ํ™” ๊ตฌ์กฐ
  • MIG (Multi-Instance GPU): A100/H100์—์„œ GPU ๋…ผ๋ฆฌ ๋ถ„ํ• 
  • ๋ณ‘๋ ฌ ํŒŒ์ผ ์‹œ์Šคํ…œ: Lustre, BeeGFS, Spectrum Scale

 

 

๐Ÿ”ท 2. ์Šค์ผ€์ค„๋ง ๋ฐ ๋ถ„์‚ฐ ํ•™์Šต ์ „๋žต

 

  • Collective Communication / Ring Topology: GPU ๊ฐ„ ํ†ต์‹  ๋ฐฉ์‹
  • Distributed Training: GPU ๋ถ„์‚ฐ ํ•™์Šต ๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ๋™๊ธฐํ™”
  • Mixed Precision Training: FP16/BF16 ํ˜ผํ•ฉ ํ•™์Šต ์ตœ์ ํ™”
  • Gang Scheduling / Elastic Scheduling: ๋™์‹œ ์‹คํ–‰/์„ ์  ์Šค์ผ€์ค„๋ง
  • Imbalance Detection: ์ž์› ์‚ฌ์šฉ๋Ÿ‰ ๋ถˆ๊ท ํ˜• ํƒ์ง€

 

 

๐Ÿ”ท 3. ๋„คํŠธ์›Œํ‚น ๋ฐ GPU ํŒจ๋ธŒ๋ฆญ

 

  • NVLink / GPU Fabric: GPU ๊ฐ„ ๊ณ ์† ์ธํ„ฐ์ปค๋„ฅํŠธ
  • RDMA / GPUDirect RDMA: GPU ๋ฉ”๋ชจ๋ฆฌ ์ง์ ‘ ์ „์†ก
  • SR-IOV / VF: ๊ฐ€์ƒ NIC ์ƒ์„ฑ ๊ธฐ์ˆ 
  • Link Aggregation: NIC ๋ฌถ๊ธฐ ํ†ตํ•œ ๋Œ€์—ญํญ ์ฆ๊ฐ€
  • In-Network Computing: ์Šค์œ„์น˜์—์„œ ๊ณ„์‚ฐ ์ˆ˜ํ–‰
  • Leaf-Spine / OVS/OVN: ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ
  • SmartNIC / UCIe / Liquid Cooling: ์ตœ์‹  ํ•˜๋“œ์›จ์–ด ๊ธฐ์ˆ 

 

 

๐Ÿ”ท 4. ๊ฐ€์ƒํ™” ๋ฐ ํด๋Ÿฌ์Šคํ„ฐ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

 

  • vGPU / GPU Passthrough: GPU ๊ฐ€์ƒํ™” ๋ฐฉ์‹
  • Resource Isolation: ํ…Œ๋„ŒํŠธ๋ณ„ ๋ฆฌ์†Œ์Šค ๋ถ„๋ฆฌ
  • K8s Operator / Orchestration: ์ž์› ์ž๋™ ์šด์˜
  • K8s Device Plugin: GPU/FPGA ๋“ฑ ๋“ฑ๋ก ๋ชจ๋“ˆ
  • ์ด๊ธฐ์ข… ํด๋Ÿฌ์Šคํ„ฐ: GPU + NPU + DPU ํ†ตํ•ฉ ์šด์˜

 

 

๐Ÿ”ท 5. AI ์ธํ”„๋ผ ๋ฐ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ์ „๋ ฅ ์„ค๊ณ„

 

  • AIDC ๊ตฌ์กฐ: ๊ณ ๋ฐ€๋„ ์—ฐ์‚ฐ ํŠนํ™” ๋ฐ์ดํ„ฐ์„ผํ„ฐ
  • 35kW ๋ž™ ์„ค๊ณ„: Worst-case ์ „๋ ฅ ์„ค๊ณ„
  • Inductor / PMIC: ์ „๋ ฅ ์•ˆ์ •ํ™” ๋ถ€ํ’ˆ
  • GPU Telemetry: ์˜จ๋„, ํŒฌ์†๋„, ์ „๋ ฅ ์‹ค์‹œ๊ฐ„ ์ˆ˜์ง‘

 

 

๐Ÿ”ท 6. ํด๋Ÿฌ์Šคํ„ฐ ๊ด€๋ฆฌ ๋ฐ ์šด์˜ ์ „๋žต

 

  • GPU Supply Chain Management: ๋ฒค๋” ์ข…์†๋„ ์™„ํ™” ์ „๋žต
  • Cluster Manager: ์ƒํƒœ ๋ชจ๋‹ˆํ„ฐ๋ง / ๋ฐฐํฌ ๋„๊ตฌ
  • Utilization Optimization / QoS: ์œ ํœด ์ž์› ์ตœ์  ํšŒ์ˆ˜
  • Inference Scheduling: ์ถ”๋ก  ์š”์ฒญ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ
  • Node Affinity / Anti-Affinity: ๋…ธ๋“œ ์ •์ฑ… ์„ค์ •

 

 

๐Ÿ”ท 7. AI ํ”Œ๋žซํผ ๋ฐ ์›Œํฌํ”Œ๋กœ์šฐ ์ž๋™ํ™”

 

  • GPUaaS / Hybrid Cloud: GPU ์„œ๋น„์Šค ํ”Œ๋žซํผ ํ˜•ํƒœ
  • Model Training / Inference: ํ•™์Šต-์ถ”๋ก  ์ „์ฒด ํ๋ฆ„
  • AIOps / MLOps: ์ž๋™ํ™”๋œ ํŒŒ์ดํ”„๋ผ์ธ
  • Vector Indexing: ๋ฒกํ„ฐ ๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ ๊ฒ€์ƒ‰
  • Checkpointing: ์ค‘๋‹จ ๋Œ€๋น„ ์ƒํƒœ ์ €์žฅ
  • Model Versioning: ์‹คํ—˜/๋กค๋ฐฑ ์œ„ํ•œ ๋ฒ„์ „ ๊ด€๋ฆฌ
  • Web-based Platform: ์›น ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ์ €์žฅ์†Œ

 

 

๐Ÿ”ท 8. ๋น„์šฉ ์ตœ์ ํ™” ๋ฐ ์šด์˜ ๋„๊ตฌ

 

  • Usage-based Billing / TCO Optimization
  • Reserved / Spot Instance: ์ธ์Šคํ„ด์Šค ๋น„์šฉ ์ „๋žต
  • Bottleneck Analysis: ๋ณ‘๋ชฉ ์ง€์  ๋ถ„์„
  • AI-based Anomaly Detection: ์ด์ƒ ์ง•ํ›„ ํƒ์ง€
  • Auto-resizing / AutoScaler (GPU-aware): ์ž์› ์žฌ์กฐ์ •
  • Prometheus / Grafana / OpenTelemetry: ๊ด€์ฐฐ์„ฑ ๋„๊ตฌ

 

 

๐Ÿ”ท 9. GPU ์ปดํŒŒ์ผ๋Ÿฌ ๋ฐ ๋””๋ฒ„๊น… ๋„๊ตฌ

 

  • ROCm / AOCC / OpenACC: GPU์šฉ ์ปดํŒŒ์ผ๋Ÿฌ
  • nvidia-smi / cuda-gdb / nsys / nvprof: ๋””๋ฒ„๊น… ๋ฐ ์„ฑ๋Šฅ ๋ถ„์„ ๋„๊ตฌ

 

728x90