[HPC/GPU 클러스터 운영 Zero to Hero 40편] Observability 구성 – Prometheus, Grafana, Loki 통합 환경 구축

HPC & GPU Engineering/AI Infrastructure Engineer

[HPC/GPU 클러스터 운영 Zero to Hero 40편] Observability 구성 – Prometheus, Grafana, Loki 통합 환경 구축

ygtoken 2025. 8. 12. 10:56

728x90

왜 Observability가 중요한가

HPC와 GPU 클러스터 운영에서 **Observability(관찰성)**은 단순한 모니터링을 넘어 시스템 내부 상태를 추론할 수 있는 능력을 의미합니다.
CPU·GPU·네트워크·스토리지 리소스뿐 아니라, Job 스케줄러(Slurm) 상태, 애플리케이션 성능 지표, 로그, 이벤트까지 통합적으로 수집하고 분석해야 성능 최적화와 장애 대응이 가능합니다.

이때 가장 널리 사용되는 오픈소스 스택이 Prometheus + Grafana + Loki입니다.

Prometheus: 시계열 메트릭 수집/저장
Grafana: 시각화 대시보드
Loki: 로그 수집/검색

아키텍처 개요

[Exporter/Agent] → [Prometheus] → [Grafana]
          ↘
           → [Loki] → [Grafana Explore]

Exporter/Agent: GPU(DGCM Exporter), 노드(Node Exporter), Slurm Exporter 등에서 메트릭 수집
Prometheus: 메트릭 저장 및 Alertmanager와 연동
Loki: 애플리케이션·시스템 로그 저장
Grafana: 메트릭/로그 통합 시각화

Prometheus 구성

1. 설치

wget https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
tar xvf prometheus-*.tar.gz

2. 설정 예시 (prometheus.yml)

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']
  - job_name: 'gpu'
    static_configs:
      - targets: ['node1:9400'] # DCGM Exporter
  - job_name: 'slurm'
    static_configs:
      - targets: ['slurm-exporter:8080']

3. 주요 Exporter

Node Exporter: CPU, 메모리, 디스크, 네트워크 메트릭
DCGM Exporter: GPU 사용량, 온도, 전력
Slurm Exporter: Job 큐 상태, 파티션 상태
Blackbox Exporter: 서비스 가용성 체크

Loki 구성

1. 설치

wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip

2. 기본 설정 (loki-config.yaml)

auth_enabled: false
server:
  http_listen_port: 3100
ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
  chunk_idle_period: 5m
  max_chunk_age: 1h
schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
      chunks:
        prefix: chunk_

3. 로그 수집 – Promtail

server:
  http_listen_port: 9080
positions:
  filename: /tmp/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
scrape_configs:
  - job_name: varlogs
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*log

Grafana 구성

1. 설치

wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz
tar -zxvf grafana-*.tar.gz

2. 데이터 소스 추가

Prometheus URL: http://prometheus:9090
Loki URL: http://loki:3100

3. 대시보드 예시

GPU Monitoring Dashboard: GPU Utilization, Memory, Power
Cluster Health Dashboard: Node 상태, Slurm 큐 상태
Log Search: Loki Explore 기능으로 Pod/Job별 로그 검색

Kubernetes 기반 통합 배포 예시

Helm Chart를 활용하면 Prometheus·Grafana·Loki를 한 번에 설치 가능:

helm repo add grafana https://grafana.github.io/helm-charts
helm install monitoring grafana/loki-stack \
  --set grafana.enabled=true \
  --set prometheus.enabled=true

운영 시 주의사항

데이터 보존 기간: Prometheus/Loki 저장 용량 및 retention 기간 설정 필수
경고(Alerting): GPU 온도, Job 실패율, 노드 오프라인 여부 등 알림 규칙 정의
성능 최적화:
- Prometheus: remote_write로 장기 보관 시스템 연동
- Loki: 압축·인덱스 튜닝
보안: Grafana에 LDAP/SAML 연동, HTTPS 적용

산업별 적용 사례

AI 연구소: GPU 사용률 모니터링 + 실시간 학습 로그 분석
HPC 센터: Slurm 큐 상태와 노드 자원 사용률 통합 시각화
클라우드 GPU 서비스: 고객별 리소스 리포트 제공

장점과 단점

장점

메트릭과 로그를 단일 대시보드에서 통합 분석
오픈소스 기반 확장성
다양한 Exporter 지원

단점

초기 설정과 튜닝 복잡
장기 데이터 보관 시 스토리지 부담
고성능 환경에서는 수집·저장 노드 리소스 요구량 큼

정리하며

Prometheus + Grafana + Loki 스택은 HPC/GPU 클러스터의 **가시성(Observability)**을 높이는 강력한 도구입니다.
메트릭과 로그를 통합적으로 수집·분석하면, 장애 원인 파악 속도와 성능 최적화 효율이 크게 향상됩니다.
운영자는 Exporter 구성, 대시보드 설계, 알림 정책까지 체계적으로 설정해 안정적인 HPC 환경을 유지해야 합니다.

728x90

'HPC & GPU Engineering > AI Infrastructure Engineer' 카테고리의 다른 글

[HPC/GPU 클러스터 운영 Zero to Hero 42편] GPU·네트워크·스토리지 병목 진단과 장애 리포트 작성 방법 – 운영 현장의 문제 해결 프로세스 (2)	2025.08.12
[HPC/GPU 클러스터 운영 Zero to Hero 41편] Slurm Exporter·DCGM Exporter 통합 모니터링 대시보드 만들기 – GPU와 스케줄러 상태를 한눈에 (0)	2025.08.12
[HPC/GPU 클러스터 운영 Zero to Hero 39편] 데이터 파이프라인 설계 – 프리페치와 체크포인트 전략 (3)	2025.08.12
[HPC/GPU 클러스터 운영 Zero to Hero 38편] K8s CSI Driver로 HPC 스토리지 연동하기 – 컨테이너 워크로드와 고성능 스토리지의 연결 (1)	2025.08.12
[HPC/GPU 클러스터 운영 Zero to Hero 37편] S3 오브젝트 스토리지 – MinIO와 Ceph RADOS Gateway 연계 (1)	2025.08.12

현재글[HPC/GPU 클러스터 운영 Zero to Hero 40편] Observability 구성 – Prometheus, Grafana, Loki 통합 환경 구축

YG Tech Blog

A blog about IT, covering topics from cloud computing and DevOps to Kubernetes and system architecture. Sharing insights, solutions, and best practices for modern IT professionals

Cilium, RAG, argocd, 파이썬, Minio, Istio, 서비스_운영, 쿠버네티스, YAML, kubernetes, langchain, CI/CD, Security, k8s, gitops, DevOps, MLOps, statefulset, 서비스메시, Python,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

YG Tech Blog