Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 8: Giám sát và ghi log hệ thống DataOps

Tác giả: thinh04 — 21/03/2026

Cấu hình Prometheus và Grafana để giám sát Kubernetes Cluster

Chúng ta sẽ triển khai Prometheus Operator và Grafana thông qua Helm để giám sát toàn bộ cluster, bao gồm các pod của DataOps pipeline.

Mục tiêu là thu thập metrics về tài nguyên (CPU, RAM) của các node và pod, đặc biệt là các pod đang chạy job huấn luyện (training) và service suy luận (inference).

Đầu tiên, thêm repository của Prometheus Community và cài đặt Prometheus Operator.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace

System sẽ deploy các component: Prometheus Server, Alertmanager, Grafana, và các service discovery agents.

Verify bằng cách kiểm tra trạng thái các pod trong namespace monitoring.

kubectl get pods -n monitoring

Kết quả mong đợi: Tất cả pod (prometheus, grafana, alertmanager, kube-state-metrics) đều ở trạng thái "Running".

Cấu hình scraping cho các pod Training và Inference

Mặc định Prometheus chỉ scrape các service có label mặc định của Kubernetes. Chúng ta cần cấu hình để nó tự động phát hiện và scrape các pod của DataOps.

Sửa file ConfigMap của Prometheus để thêm selector cho các pod có label `app.kubernetes.io/component: training` hoặc `inference`.

Tạo file cấu hình bổ sung tại `/etc/prometheus/additional-scrape-configs.yml` (thông qua Kubernetes ConfigMap).

kubectl edit configmap prometheus-kube-prometheus-prometheus -n monitoring

Thêm đoạn `additionalScrapeConfigs` vào phần `data` của ConfigMap. Nội dung cần thêm:

additionalScrapeConfigs: |
  - job_name: 'dataops-training'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: training
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        action: keep
        regex: metrics
  - job_name: 'dataops-inference'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        action: keep
        regex: inference
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        action: keep
        regex: metrics

Lưu file và restart Prometheus pod để áp dụng cấu hình mới.

kubectl delete pod -n monitoring -l app.kubernetes.io/name=prometheus

Verify: Truy cập trang Targets của Prometheus (port 9090) và kiểm tra xem các job mới có trạng thái "UP" không.

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

Mở trình duyệt truy cập `http://localhost:9090/targets`. Bạn sẽ thấy 2 job mới: `dataops-training` và `dataops-inference`.

Đẩy dashboard Grafana chuyên dụng cho DataOps

Chúng ta cần dashboard để xem trực quan hóa các metrics vừa thu thập.

Tạo file `grafana-dashboard.json` để import vào Grafana. Dashboard này sẽ hiển thị: GPU Usage (nếu có), Memory Usage, Training Loss (nếu được expose qua metrics), và Inference Latency.

Giả sử file dashboard JSON đã được chuẩn bị sẵn (dựa trên template MLflow/Prometheus), ta import vào Grafana qua API hoặc UI.

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-grafana 3000:80

Truy cập `http://localhost:3000`, đăng nhập (user: admin, pass: prom-operator), vào menu "Import" và upload file JSON.

Verify: Sau khi import, chọn dashboard vừa tạo, đảm bảo các biểu đồ hiện dữ liệu thời gian thực từ các pod DataOps.

Thu thập log từ các container Training và Inference

Metrics chỉ cho thấy hiệu suất, còn log giúp debug lỗi cụ thể. Trong môi trường Kubernetes, log của container được lưu vào stdout/stderr.

Chúng ta sẽ cấu hình EFK Stack (Elasticsearch, Fluentd, Kibana) hoặc đơn giản hơn là sử dụng Loki + Promtail để thu thập log, tích hợp sẵn với Prometheus/Grafana đã có.

Triển khai Loki và Promtail thông qua Helm.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack --namespace monitoring --create-namespace --set promtail.enabled=true --set loki.singleBinary.enabled=true --set grafana.enabled=false

System sẽ deploy Loki (log storage), Promtail (log collector), và các agent cần thiết.

Cấu hình Promtail để lọc log theo label

Chúng ta cần Promtail chỉ thu thập log từ các pod có label liên quan đến DataOps để giảm tải cho hệ thống lưu trữ.

Sửa ConfigMap của Promtail trong namespace monitoring.

kubectl edit configmap loki-promtail -n monitoring

Thêm cấu hình `scrape_configs` vào phần `config` của ConfigMap.

config: |
  positions:
    filename: /var/lib/promtail/positions.yaml
  clients:
    - url: http://loki:3100/loki/api/v1/push
  scrape_configs:
    - job_name: dataops-logs
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
          action: keep
          regex: training|inference
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: __task__
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - source_labels: [__meta_kubernetes_pod_container_name]
          target_label: container
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
        - source_labels: [__meta_kubernetes_namespace]
          target_label: __path__
          replacement: /var/log/pods/*$1/*.log
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)

Lưu và restart Promtail pod.

kubectl delete pod -n monitoring -l app.kubernetes.io/name=promtail

Verify: Kiểm tra xem log đã được đẩy vào Loki chưa bằng cách truy vấn trực tiếp qua Grafana (nếu đã bật) hoặc dùng `loki-query`.

kubectl port-forward -n monitoring svc/loki-gateway 3100:80

Truy cập `http://localhost:3100` (Grafana) -> chọn Data Source Loki -> Query: `{namespace="your-namespace", app_kubernetes_io_component="training"}`.

Kết quả mong đợi: Xuất hiện dòng log từ quá trình huấn luyện hoặc inference gần nhất.

Tích hợp log vào Grafana Dashboard

Để xem log cùng lúc với metrics (ví dụ: khi latency tăng, xem log lỗi tương ứng), ta thêm panel "Logs" vào dashboard Grafana đã tạo ở phần trước.

Trong Dashboard Editor của Grafana, thêm panel mới, chọn "Logs" visualization, và chọn Data Source là Loki.

Query mặc định sẽ là `{namespace="$namespace", pod="$pod"}`. Thêm variable `$namespace` và `$pod` vào dashboard variables để lọc động.

Verify: Chọn một pod cụ thể trong dropdown, panel Logs sẽ hiển thị log của pod đó đồng bộ với thời gian trên trục X của biểu đồ metrics.

Đặt cảnh báo (Alerting) cho lỗi huấn luyện và downtime

Cảnh báo giúp đội ngũ vận hành phản ứng nhanh khi pipeline bị sự cố. Chúng ta sử dụng Alertmanager của Prometheus để gửi thông báo (email/slack).

Cấu hình các rule cảnh báo cho 2 kịch bản: Training job crash (lỗi) và Inference service downtime (unavailable).

Tạo file `dataops-alerts.yaml` để định nghĩa các alert rules.

cat > /tmp/dataops-alerts.yaml  0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Training job $labels{pod} đã crash"
          description: "Pod training $labels{pod} đã restart $labels{restart_count} lần trong 5 phút qua."
  - name: dataops-inference
    rules:
      - alert: InferenceServiceDown
        expr: avg(up{namespace="ml", app_kubernetes_io_component="inference"}) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Inference service không khả dụng"
          description: "Tất cả các pod inference trong namespace ml đã down."
      - alert: HighInferenceLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace="ml", app_kubernetes_io_component="inference"}[5m])) by (le)) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Latency inference cao bất thường"
          description: "P95 latency của inference service vượt quá 500ms trong 5 phút."
EOF

Copy file này vào ConfigMap của Prometheus để nó load các rules.

kubectl create configmap dataops-alerts-rules -n monitoring --from-file=dataops-alerts.yaml=/tmp/dataops-alerts.yaml

Sửa ConfigMap `prometheus-kube-prometheus-prometheus` để tham chiếu đến file rules mới này trong `ruleSelector` hoặc thêm vào `additionalRuleSelector`.

kubectl edit configmap prometheus-kube-prometheus-prometheus -n monitoring

Thêm đoạn `additionalRuleSelector` vào `spec`:

additionalRuleSelector:
  matchLabels:
    app: dataops-alerts

Và cập nhật label trong ConfigMap `dataops-alerts-rules` để khớp (nếu cần) hoặc dùng `additionalScrapeConfigs` để load file. Cách đơn giản nhất là thêm vào `additionalAlertRelabelConfigs` hoặc mount file vào `/etc/prometheus`.

Cách chuẩn hơn với kube-prometheus-stack: Tạo ConfigMap mới có label `app: prometheus` và key `dataops-alerts.yaml`.

kubectl create configmap prometheus-rules-dataops -n monitoring --from-file=dataops-alerts.yaml=/tmp/dataops-alerts.yaml --dry-run=client -o yaml | kubectl apply -f -
kubectl edit configmap prometheus-kube-prometheus-prometheus -n monitoring
# Thêm vào phần data:
# additionalScrapeConfigs: ...
# additionalAlertRelabelConfigs: ...
# Quan trọng: Thêm vào spec: ruleSelector: matchLabels: app: prometheus-rules-dataops

Restart Prometheus để load rule mới.

kubectl delete pod -n monitoring -l app.kubernetes.io/name=prometheus

Verify: Truy cập Prometheus UI (`http://localhost:9090`) -> tab "Alerts". Bạn sẽ thấy các alert ở trạng thái "Pending" hoặc "Firing" nếu điều kiện được thỏa mãn (ví dụ: kill 1 pod inference để test).

kubectl delete pod -n ml -l app.kubernetes.io/component=inference

Chờ 2 phút, vào Prometheus Alerts, bạn sẽ thấy alert "InferenceServiceDown" chuyển sang trạng thái "Firing".

Theo dõi hiệu suất Inference (Latency, Throughput) thời gian thực

Để tối ưu hiệu suất inference, cần giám sát 2 chỉ số quan trọng: Latency (thời gian phản hồi) và Throughput (số request/giây).

Chúng ta sẽ cấu hình ứng dụng Inference (giả sử là FastAPI/Flask trên K8s) để expose metrics dưới dạng Prometheus format.

Trong code ứng dụng Python (Dockerfile), thêm thư viện `prometheus_client` và tạo endpoint `/metrics`.

pip install prometheus_client

Đoạn code trong `app.py` của service inference:

from prometheus_client import Counter, Histogram
import time

# Định nghĩa metrics
REQUEST_COUNT = Counter('inference_requests_total', 'Tổng số request inference', ['model_name', 'status'])
REQUEST_LATENCY = Histogram('inference_request_duration_seconds', 'Latency của request inference', ['model_name'])

def predict(request_data):
    start_time = time.time()
    try:
        # Logic inference thực tế
        result = model.predict(request_data)
        status = "success"
    except Exception as e:
        status = "error"
        result = None

    duration = time.time() - start_time
    REQUEST_COUNT.labels(model_name="resnet50", status=status).inc()
    REQUEST_LATENCY.labels(model_name="resnet50").observe(duration)
    return result

Đảm bảo container expose port metrics (ví dụ 9000) và thêm label `app.kubernetes.io/component: inference` vào pod spec để Prometheus scrape đúng (như đã cấu hình ở phần 1).

Trong Kubernetes Deployment của inference, thêm container port và annotation để Prometheus phát hiện:

spec:
  containers:
    - name: inference
      image: your-registry/inference-app:latest
      ports:
        - containerPort: 8000
          name: http
        - containerPort: 9000
          name: metrics
      env:
        - name: PROMETHEUS_PORT
          value: "9000"
  metadata:
    labels:
      app.kubernetes.io/component: inference

Verify: Sau khi deploy lại pod inference, vào Prometheus -> Graph, nhập query:

rate(inference_requests_total{namespace="ml", model_name="resnet50"}[5m])

Kết quả mong đợi: Hiển thị đường biểu đồ số request/giây.

Query tiếp theo để xem Latency P95:

histogram_quantile(0.95, sum(rate(inference_request_duration_seconds_bucket{namespace="ml", model_name="resnet50"}[5m])) by (le))

Kết quả mong đợi: Hiển thị giá trị latency trung bình (P95) của request, đơn vị giây.

Thêm các panel này vào Dashboard Grafana đã tạo ở phần đầu để theo dõi liên tục.

Điều hướng series:

Mục lục: Series: Xây dựng nền tảng DataOps với DVC, MLflow và Kubernetes cho vòng đời AI

« Phần 7: Tự động hóa quy trình CI/CD cho DataOps

Phần 8: Giám sát và ghi log hệ thống DataOps »