Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 3: Cấu hình Prometheus để thu thập và giám sát chỉ số AI

Tác giả: ngohuy00 — 21/03/2026

Cài đặt Prometheus và Grafana trong Kubernetes

Chúng ta sẽ triển khai Prometheus và Grafana bằng Kubernetes Manifests trực tiếp vào cluster. Đây là bước nền tảng để thu thập dữ liệu và hiển thị dashboard.

Tạo thư mục cấu hình trên local machine hoặc node master để lưu trữ các file YAML trước khi áp dụng.

mkdir -p /etc/prometheus-operator && cd /etc/prometheus-operator

Kết quả: Thư mục được tạo, bạn đang ở trong thư mục đó để lưu file.

Triển khai Prometheus với Prometheus Operator

Sử dụng file YAML để deploy Prometheus CRD và instance. File này bao gồm cả Prometheus Server và Prometheus Operator để quản lý ServiceMonitor.

Lưu file dưới đường dẫn đầy đủ: /etc/prometheus-operator/prometheus-deployment.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: ["extensions"]
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitoring
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: k8s
  namespace: monitoring
  labels:
    app: prometheus
spec:
  serviceMonitorSelector: {}
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelectorNilMatches: true
  resources:
    requests:
      memory: 400Mi
      cpu: 100m
    limits:
      memory: 1000Mi
      cpu: 500m

Kết quả: Prometheus được khởi tạo trong namespace monitoring và sẵn sàng quét các ServiceMonitor.

Triển khai Grafana để hiển thị Dashboard

Deploy Grafana để kết nối với Prometheus datasource. Chúng ta dùng Helm hoặc YAML đơn giản. Ở đây dùng YAML để giữ sự đồng nhất.

Lưu file dưới đường dẫn đầy đủ: /etc/prometheus-operator/grafana-deployment.yaml

apiVersion: v1
kind: Secret
metadata:
  name: grafana
  namespace: monitoring
type: Opaque
stringData:
  admin-user: admin
  admin-password: admin
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:latest
          ports:
            - containerPort: 3000
          env:
            - name: GF_SECURITY_ADMIN_USER
              valueFrom:
                secretKeyRef:
                  name: grafana
                  key: admin-user
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana
                  key: admin-password
          volumeMounts:
            - name: grafana-storage
              mountPath: /var/lib/grafana
      volumes:
        - name: grafana-storage
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  type: LoadBalancer
  ports:
    - port: 3000
      targetPort: 3000
      protocol: TCP
  selector:
    app: grafana

Kết quả: Grafana chạy và expose qua LoadBalancer (hoặc NodePort tùy cluster), bạn có thể truy cập qua IP public.

Verify kết quả bằng cách kiểm tra trạng thái pod:

kubectl get pods -n monitoring

Kết quả mong đợi: Thấy các pod prometheus-k8s-0 và grafana-xxxxx ở trạng thái Running.

Cấu hình ServiceMonitor để thu thập metric từ dịch vụ AI

Prometheus Operator cần biết nơi nào để scrape metric. Chúng ta dùng Custom Resource ServiceMonitor thay vì chỉnh sửa config.yaml của Prometheus.

Giả sử bạn đã có một microservice AI (ví dụ: ai-inference-service) trong namespace ai-platform và nó expose endpoint /metrics ở port 8080.

Tạo ServiceMonitor cho AI Inference Service

File này nói với Prometheus: "Hãy đi vào namespace ai-platform, tìm service có label app: ai-inference và scrape port 8080".

Lưu file dưới đường dẫn đầy đủ: /etc/prometheus-operator/ai-service-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-inference-monitor
  namespace: monitoring
  labels:
    app: ai-inference
spec:
  selector:
    matchLabels:
      app: ai-inference
  namespaceSelector:
    matchNames:
      - ai-platform
  endpoints:
    - port: metrics
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s

Kết quả: Prometheus tự động phát hiện service này và bắt đầu scrape metric theo chu kỳ 15 giây.

Để ServiceMonitor hoạt động, service của bạn trong namespace ai-platform phải có label app: ai-inference và port name là metrics. Nếu chưa có, cập nhật service YAML của AI app.

Verify kết quả: Kiểm tra Prometheus UI (port 9090) mục "Status" -> "Targets".

kubectl port-forward -n monitoring svc/prometheus-k8s 9090:9090

Kết quả mong đợi: Trong tab Targets, bạn thấy ai-inference-monitor có trạng thái UP.

Định nghĩa Custom Metric cho các sự kiện AI

Metric mặc định của Prometheus (CPU, RAM) chưa đủ cho AI Governance. Chúng ta cần định nghĩa metric business logic như: số lần infer, độ trễ (latency), và tỷ lệ lỗi.

Trong ứng dụng AI (ví dụ Python/FastAPI), bạn cần expose các metric này theo chuẩn Prometheus. Dưới đây là code Python mẫu để expose metric, sau đó Prometheus sẽ tự thu thập qua ServiceMonitor đã tạo.

Code mẫu để expose Custom Metric trong ứng dụng AI

Thư viện prometheus_client được sử dụng để tạo counter và histogram. Đặt đoạn code này vào file chính của ứng dụng AI (ví dụ: app.py).

from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response

app = Flask(__name__)

# 1. Counter: Số lần inference (tách theo model và status)
inference_total = Counter(
    'ai_inference_total',
    'Total number of AI inference requests',
    ['model_name', 'status']
)

# 2. Histogram: Độ trễ của inference (unit: giây)
inference_latency = Histogram(
    'ai_inference_duration_seconds',
    'Duration of AI inference requests',
    ['model_name'],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# 3. Gauge: Số lượng request đang xử lý (active)
inference_active = Counter(
    'ai_inference_active',
    'Number of active inference requests',
    ['model_name']
)

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype='text/plain')

@app.route('/infer', methods=['POST'])
def infer():
    model_name = 'bert-base-uncased'
    # Giả lập logic inference
    try:
        # Đo thời gian
        start_time = time.time()
        # ... code inference ...
        duration = time.time() - start_time
        
        inference_total.labels(model_name=model_name, status='success').inc()
        inference_latency.labels(model_name=model_name).observe(duration)
        return {'result': 'success'}
    except Exception as e:
        inference_total.labels(model_name=model_name, status='error').inc()
        return {'error': str(e)}, 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Kết quả: Khi chạy ứng dụng này, endpoint /metrics sẽ trả về dữ liệu dạng # HELP ai_inference_total ... mà Prometheus có thể parse.

Đảm bảo metric có thể đọc được từ Prometheus

Trong Kubernetes, bạn cần expose port 8080 của container làm service port name là metrics để khớp với ServiceMonitor ở phần trên.

Verify bằng cách curl trực tiếp vào pod AI:

kubectl exec -it -n ai-platform -- curl http://localhost:8080/metrics | grep ai_inference

Kết quả mong đợi: Xuất ra các dòng metric ai_inference_total, ai_inference_duration_seconds có giá trị số.

Tích hợp OPA với Prometheus để thu thập trạng thái chính sách

OPA Gatekeeper (hoặc OPA Agent) có thể expose metric về trạng thái của các chính sách (Policies) và violations. Điều này giúp chúng ta biết "hệ thống đang vi phạm luật gì" dưới dạng số liệu.

Giả sử bạn đã cài OPA Gatekeeper ở Phần 2. Chúng ta cần cấu hình OPA để expose metric và tạo ServiceMonitor để Prometheus đọc nó.

Cấu hình OPA để expose metric

Trên cluster Kubernetes, OPA Gatekeeper đã có sẵn endpoint /metrics trong namespace gatekeeper-system (mặc định). Tuy nhiên, chúng ta cần đảm bảo Service của nó có port name metrics hoặc cập nhật ServiceMonitor cho đúng port (thường là 8080 hoặc 9090 tùy version).

Ở đây ta tạo ServiceMonitor riêng cho OPA Gatekeeper.

Lưu file dưới đường dẫn đầy đủ: /etc/prometheus-operator/opa-service-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: opa-gatekeeper-monitor
  namespace: monitoring
  labels:
    app: opa-gatekeeper
spec:
  selector:
    matchLabels:
      app: gatekeeper-audit
  namespaceSelector:
    matchNames:
      - gatekeeper-system
  endpoints:
    - port: http-prometheus
      path: /metrics
      interval: 30s

Kết quả: Prometheus bắt đầu thu thập metric từ OPA Gatekeeper, bao gồm gatekeeper_audit_violation_count, gatekeeper_audit_violation_total.

Giải thích các metric OPA quan trọng

Khi OPA enforce policy, nó tạo ra các metric:

gatekeeper_audit_violation_count: Số lượng vi phạm hiện tại của một policy cụ thể.
gatekeeper_audit_violation_total: Tổng số lần vi phạm được ghi log (counter).
gatekeeper_audit_success_count: Số lần kiểm tra thành công (compliance).

Verify kết quả: Truy cập Prometheus UI, vào tab "Graph", nhập query:

gatekeeper_audit_violation_count

Kết quả mong đợi: Nếu có policy nào bị vi phạm, bạn sẽ thấy series metric với giá trị > 0.

Xây dựng Dashboard cơ bản hiển thị trạng thái tuân thủ chính sách AI

Bây giờ dữ liệu đã có trong Prometheus, chúng ta sẽ tạo Dashboard trên Grafana để trực quan hóa AI Governance.

Cấu hình Datasource trong Grafana

Truy cập Grafana (IP:3000), đăng nhập (admin/admin). Vào "Configuration" (icon bánh răng) -> "Data sources".

Bấm "Add data source", chọn "Prometheus".

Điền thông tin:

Name: Prometheus
URL: http://prometheus-k8s.monitoring.svc.cluster.local:9090
Bấm "Save & Test" để đảm bảo kết nối.

Kết quả: Thông báo "Data source is working" hiện ra.

Tạo Dashboard cho AI Governance

Vào "Dashboards" -> "New" -> "Import". Chúng ta sẽ tạo dashboard bằng cách nhập JSON hoặc tạo thủ công các panel. Dưới đây là nội dung JSON mẫu cho dashboard AI Governance, bao gồm 4 panel: Tổng số infer, Độ trễ trung bình, Tỷ lệ lỗi, và Số vi phạm Policy.

Lưu file dưới đường dẫn đầy đủ: /etc/prometheus-operator/ai-governance-dashboard.json

{
  "annotations": {
    "list": []
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 80 }
            ]
          },
          "unit": "short"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
      "id": 1,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "9.0.0",
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "prometheus" },
          "expr": "sum(rate(ai_inference_total[5m]))",
          "legendFormat": "Total Inference/sec",
          "refId": "A"
        }
      ],
      "title": "Total Inference Rate",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "prometheus" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 0.5 },
              { "color": "red", "value": 1.0 }
            ]
          },
          "unit": "s"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "9.0.0",
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "prometheus" },
          "expr": "histogram_quantile(0.95, sum(rate(ai_inference_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "P95 Latency",
          "refId": "A"
        }
      ],
      "title": "AI Latency (P95)",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "prometheus" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 0.01 }
            ]
          },
          "unit": "percentunit"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
      "id": 3,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "9.0.0",
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "prometheus" },
          "expr": "sum(rate(ai_inference_total{status=\"error\"}[5m])) / sum(rate(ai_inference_total[5m]))",
          "legendFormat": "Error Rate",
          "refId": "A"
        }
      ],
      "title": "Inference Error Rate",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "prometheus" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 1 }
            ]
          },
          "unit": "short"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
      "id": 4,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "9.0.0",
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "prometheus" },
          "expr": "sum(gatekeeper_audit_violation_count)",
          "legendFormat": "Policy Violations",
          "refId": "A"
        }
      ],
      "title": "Total Policy Violations",
      "type": "stat"
    }
  ],
  "refresh": "5s",
  "schemaVersion": 36,
  "style": "dark",
  "tags": ["ai", "governance"],
  "templating": { "list": [] },
  "time": { "from": "now-1h", "to": "now" },
  "timepicker": {},
  "timezone": "browser",
  "title": "AI Governance Dashboard",
  "uid": "ai-governance-01",
  "version": 1,
  "weekStart": ""
}

Kết quả: Khi import file này vào Grafana, bạn sẽ có một dashboard hiển thị 4 chỉ số quan trọng theo thời gian thực.

Verify kết quả cuối cùng:

Mở Dashboard "AI Governance Dashboard" trong Grafana.
Thực hiện một vài request đến service AI của bạn để tạo traffic.
Quan sát panel "Total Inference Rate" tăng lên.
Thử tạo một resource vi phạm policy (nếu có) để panel "Total Policy Violations" nhảy lên số > 0.

Kết quả mong đợi: Các số liệu trên dashboard thay đổi theo thời gian thực, phản ánh chính xác trạng thái hoạt động và tuân thủ của hệ thống AI.

Điều hướng series:

Mục lục: Series: Xây dựng nền tảng AI Governance tự động với OPA, Prometheus và Policy Engine

« Phần 2: Triển khai OPA Gatekeeper và viết chính sách đầu tiên cho AI

Phần 4: Tích hợp OPA và Prometheus để tự động hóa quy trình Governance »