Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 9: Tối ưu hiệu năng và giám sát hệ thống

Tác giả: tranmai93 — 21/03/2026

Triển khai Prometheus và Grafana trên Kubernetes

Cài đặt Prometheus Operator và Stack Monitoring

Sử dụng Helm chart của Prometheus Community để cài đặt nhanh chóng Prometheus, Grafana, Alertmanager và Node Exporter. Đây là tiêu chuẩn de-facto để giám sát Kubernetes.

Thêm repository và cài đặt với giá trị mặc định phù hợp cho môi trường production.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

Kiểm tra trạng thái pod sau khi cài đặt, đảm bảo tất cả component đều ở trạng thái Running.

kubectl get pods -n monitoring

Cấu hình ServiceMonitor cho Pod AI

Để Prometheus tự động phát hiện và scrape metrics từ Pod chạy ứng dụng AI (bao gồm cả WASM), ta cần tạo resource ServiceMonitor. Pod của bạn cần có endpoint `/metrics` (thường do Prometheus client library của C++/NodeJS expose).

Tạo file cấu hình ServiceMonitor với đường dẫn đầy đủ `/etc/monitoring/service-monitor.yaml`.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: wasm-ai-service-monitor
  namespace: monitoring
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: wasm-ai-app
  endpoints:
  - port: http
    interval: 10s
    path: /metrics
  namespaceSelector:
    matchNames:
    - default

Áp dụng cấu hình vào cluster.

kubectl apply -f /etc/monitoring/service-monitor.yaml

Verify bằng cách truy cập Prometheus UI, chọn mục "Status" -> "Targets". Target của wasm-ai-app phải hiện trạng thái "UP".

Giám sát thời gian suy luận (Inference Latency) của WASM

Định nghĩa Histogram cho Latency trong C++/WASM

Để đo lường hiệu năng chính xác, không dùng đơn giản là `timer`. Cần sử dụng Histogram để phân tích phân bố thời gian (P50, P90, P99). Trong code C++ engine của bạn, hãy tích hợp thư viện `prometheus-cpp` và expose metrics này.

Đoán đoạn code C++ cần thêm vào hàm inference để ghi nhận thời gian xử lý (giả sử bạn đã build lại WASM với metrics).

#include "prometheus/exposer.h"
#include "prometheus/metric_family.h"

// Khởi tạo Histogram
auto latency_histogram = registry->AddHistogram(
    "wasm_inference_latency_seconds",
    "Histogram of inference latency in seconds",
    {
        {0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0}
    }
);

// Trong hàm inference:
auto start = std::chrono::high_resolution_clock::now();
// ... thực hiện logic WASM ...
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast(end - start).count() / 1000000.0;
latency_histogram->Observe(duration);

Build lại container với code mới và deploy lên K8s. Prometheus sẽ tự động scrape histogram này thông qua ServiceMonitor đã cấu hình ở phần trên.

Tạo Dashboard Grafana cho Latency

Tạo file dashboard JSON để hiển thị trực quan các chỉ số P50, P90, P99. Lưu file này vào `/etc/grafana/dashboards/wasm-latency.json`.

{
  "dashboard": {
    "title": "WASM AI Inference Performance",
    "panels": [
      {
        "id": 1,
        "title": "Inference Latency (P99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(wasm_inference_latency_seconds_bucket[5m])) by (le))",
            "refId": "A"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "type": "graph"
      },
      {
        "id": 2,
        "title": "Inference Latency (P50)",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(wasm_inference_latency_seconds_bucket[5m])) by (le))",
            "refId": "A"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "type": "graph"
      },
      {
        "id": 3,
        "title": "Throughput (Requests per second)",
        "targets": [
          {
            "expr": "sum(rate(wasm_inference_latency_seconds_count[5m]))",
            "refId": "A"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "type": "graph"
      }
    ]
  }
}

Copy file dashboard vào container Grafana và reload.

kubectl cp /etc/grafana/dashboards/wasm-latency.json monitoring-grafana-0:/var/lib/grafana/dashboards/wasm-latency.json

Verify bằng cách mở Grafana (port-forward), vào Dashboard -> Import, chọn file vừa upload. Biểu đồ phải hiện dữ liệu realtime khi có request.

Phân tích hiệu năng mạng giữa Client và Server

Triển khai cURL để test Latency từ Client

Sử dụng một Pod đơn giản chứa cURL để mô phỏng client, đo tổng thời gian mạng (RTT) + thời gian xử lý server. Đây là cách kiểm tra "End-to-End latency".

Chạy container tạm thời để thực hiện test.

kubectl run -it --rm network-test --image=curlimages/curl --restart=Never -- /bin/sh

Bên trong container, thực hiện lệnh test với thời gian (time) và hiển thị header để phân tích.

time curl -w "\nTransfer Time: %{time_total}s\nTTL: %{time_connect}s\nStart Transfer: %{time_starttransfer}s\n" -o /dev/null -s http://wasm-ai-service.default.svc.cluster.local:8080/infer

Phân tích kết quả: `time_total` là tổng thời gian, `time_starttransfer` là thời gian từ khi kết nối xong đến khi nhận byte đầu tiên (chứa thời gian xử lý server).

Cấu hình Metrics mạng trong K8s (Network Policy & Metrics)

Để giám sát lâu dài, ta cần metrics về số lượng request bị lỗi do timeout hoặc kết nối. Sử dụng `kube-state-metrics` (thường đã có trong stack) hoặc custom metrics.

Định nghĩa Alert Rule để cảnh báo khi mạng bị nghẽn (latency cao bất thường).

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: wasm-network-alerts
  namespace: monitoring
  labels:
    app: wasm-ai
spec:
  groups:
  - name: wasm-network
    rules:
    - alert: HighNetworkLatency
      expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High network latency detected on WASM AI service"
        description: "P95 latency is above 500ms for more than 5 minutes."

Áp dụng rule vào Prometheus.

kubectl apply -f /etc/monitoring/prometheus-rules.yaml

Verify bằng cách vào Prometheus UI -> Alerting. Rule "HighNetworkLatency" sẽ xuất hiện trong danh sách. Có thể tạo một load test để kích hoạt alert này.

Tối ưu hóa cấu hình Pod để giảm thời gian khởi động

Giảm kích thước Image Docker

Thời gian khởi động (Startup Time) của Pod phụ thuộc lớn vào việc tải image từ registry. Cần tối ưu Dockerfile sử dụng base image nhỏ (alpine/slim) và loại bỏ các package không cần thiết cho runtime WASM.

Cấu trúc Dockerfile tối ưu cho WASM inference engine.

FROM node:18-alpine as builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build-wasm

FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/build ./build
COPY --from=builder /app/*.wasm ./build/
EXPOSE 8080
CMD ["node", "server.js"]

Build và push image mới.

docker build -t your-registry/wasm-ai:optimized .
docker push your-registry/wasm-ai:optimized

Verify bằng cách so sánh kích thước image cũ và mới (`docker images`). Kích thước nên giảm đáng kể (ví dụ từ 500MB xuống 150MB).

Cấu hình Readiness Probe và Startup Probe

Để giảm thời gian Pod bị "Pending" hoặc "CrashLoopBackOff" do khởi động chậm, ta dùng `startupProbe`. Điều này cho phép Pod chạy lâu hơn trước khi K8s bắt đầu kill nó nếu chưa sẵn sàng. `readinessProbe` sẽ đảm bảo traffic chỉ được route đến khi WASM engine đã load model xong.

Cập nhật Deployment với các probe tối ưu.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wasm-ai-deployment
spec:
  template:
    spec:
      containers:
      - name: wasm-ai
        image: your-registry/wasm-ai:optimized
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "500m"
          limits:
            memory: "512Mi"
            cpu: "1000m"
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          failureThreshold: 30 # Cho phép 5 phút khởi động (30 * 10s)
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 2
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 20

Áp dụng deployment mới.

kubectl apply -f deployment.yaml

Verify bằng cách xóa pod để trigger khởi động lại và theo dõi logs.

kubectl delete pod -l app=wasm-ai-app
kubectl get pod -l app=wasm-ai-app -w

Quan sát trạng thái chuyển từ `ContainerCreating` sang `Running`. Thời gian này nên ngắn hơn và không bị kill giữa chừng nhờ `startupProbe`.

Điều hướng series:

Mục lục: Series: Xây dựng nền tảng Real-time AI với WebAssembly, TensorFlow Lite và Kubernetes

« Phần 8: Tích hợp cơ chế mở rộng tự động (Auto-scaling)

Phần 10: Troubleshooting và các mẹo nâng cao »