Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 3: Triển khai vLLM trên Kubernetes với cơ chế PagedAttention

Tác giả: khang.nguyen92 — 21/03/2026

Xây dựng Docker Image vLLM với CUDA và FlashAttention

Chuẩn bị môi trường build

Chúng ta sẽ không dùng image có sẵn từ Docker Hub vì thường thiếu tối ưu cho kiến trúc GPU cụ thể của bạn. Cần build từ source để đảm bảo vLLM liên kết đúng với CUDA Toolkit và FlashAttention đã cài đặt trong cluster.

Truy cập vào node có GPU (tránh build trên node không có GPU) và tạo thư mục làm việc.

mkdir -p ~/vllm-build && cd ~/vllm-build

Clone repository vLLM chính thức. Version stable hiện tại là 0.6.x, nhưng để có FlashAttention 2 đầy đủ, cần kiểm tra version tag.

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.6.3.post1

Verify rằng Docker và NVIDIA Container Toolkit đã sẵn sàng trên host để build image GPU.

docker info | grep -i nvidia

Kết quả mong đợi: Thấy dòng "Runtime: nvidia" hoặc "Runtimes: nvidia runc".

Build Docker Image tối ưu

Sử dụng Dockerfile có sẵn của vLLM, nhưng cần set biến môi trường để kích hoạt FlashAttention 2 và tối ưu hóa cho GPU H100/A100.

Tạo file Dockerfile tùy chỉnh tại /home/youruser/vllm-build/Dockerfile.vllm:

FROM nvcr.io/nvidia/cuda:12.3.1-cudnn8-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=$CUDA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

RUN apt-get update && apt-get install -y \
    python3.10 \
    python3.10-venv \
    python3.10-dev \
    git \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

RUN python3.10 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN pip install --upgrade pip setuptools wheel

# Cài PyTorch CUDA 12.1 trước
RUN pip install torch --index-url https://download.pytorch.org/whl/cu121

# Copy source vLLM
COPY . /vllm

# Build vLLM với FlashAttention
RUN cd /vllm && \
    pip install --no-build-isolation -e ".[all]" && \
    python -m vllm.utils --flash-attn-2 --use-triton

EXPOSE 8000

CMD ["python3", "-m", "vllm.entrypoints.api_server", "--model", "mistralai/Mistral-7B-v0.1"]

Thực thi lệnh build image. Quá trình này có thể mất 15-30 phút tùy tốc độ CPU và mạng.

cd ~/vllm-build
docker build -t vllm-inference:latest -f Dockerfile.vllm .

Kết quả mong đợi: Dòng cuối cùng hiển thị "Successfully built [IMAGE_ID]".

Verify Image trước khi push

Chạy thử image trực tiếp trên host để đảm bảo CUDA và FlashAttention hoạt động trước khi đưa lên K8s.

docker run --gpus all --rm -p 8000:8000 vllm-inference:latest \
  --model mistralai/Mistral-7B-v0.1 \
  --served-model-name mistral-7b \
  --tensor-parallel-size 1

Trong terminal khác, test request API:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-7b", "prompt": "Hello, world!", "max_tokens": 10}'

Kết quả mong đợi: Nhận về JSON chứa "choices" với nội dung trả lời từ model. Nếu thấy lỗi CUDA hoặc OOM, dừng lại để debug tại bước này.

Triển khai vLLM trên Kubernetes

Cấu hình Namespace và GPU Node Selector

Tạo namespace riêng biệt để cách ly workload AI. Cấu hình node selector để đảm bảo Pod chỉ được schedule lên node có GPU (dựa trên label đã tạo trong Phần 2).

cat

Kiểm tra label GPU trên node (thường là nvidia.com/gpu hoặc cloud.google.com/gke-accelerator tùy setup). Giả sử label là nvidia.com/gpu: "true".

Viết Kubernetes Manifest cho vLLM Server

Tạo file manifest YAML hoàn chỉnh tại /home/youruser/vllm-deploy/vllm-deployment.yaml. Manifest này bao gồm Service, Deployment và ConfigMap.

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm-inference
spec:
  type: ClusterIP
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
  selector:
    app: vllm-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  namespace: vllm-inference
  labels:
    app: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      # Chỉ chạy trên node có GPU
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - name: vllm-server
          image: vllm-inference:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
          env:
            - name: VLLM_MODEL
              value: "mistralai/Mistral-7B-v0.1"
            - name: VLLM_TENSOR_PARALLEL_SIZE
              value: "1"
            - name: VLLM_MAX_NUM_BATCHES
              value: "256"
            - name: VLLM_GPU_MEMORY_UTILIZATION
              value: "0.95"
          command:
            - python3
            - -m
            - vllm.entrypoints.api_server
            - --model
            - $(VLLM_MODEL)
            - --served-model-name
            - mistral-7b
            - --tensor-parallel-size
            - $(VLLM_TENSOR_PARALLEL_SIZE)
            - --max-num-batches
            - $(VLLM_MAX_NUM_BATCHES)
            - --gpu-memory-utilization
            - $(VLLM_GPU_MEMORY_UTILIZATION)
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
              cpu: "4"
              memory: "16Gi"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
          securityContext:
            runAsNonRoot: false
            runAsUser: 0
      volumes:
        - name: model-cache
          emptyDir:
            medium: "Memory"
            sizeLimit: "50Gi"

Deploy manifest lên cluster:

kubectl apply -f /home/youruser/vllm-deploy/vllm-deployment.yaml

Kiểm tra trạng thái Pod:

kubectl get pods -n vllm-inference

Kiểm tra logs để đảm bảo vLLM đã load model và khởi động endpoint:

kubectl logs -n vllm-inference -l app=vllm-server -f

Kết quả mong đợi: Pod chuyển sang trạng thái Running, logs hiển thị "Engine initialized" và "Serving at http://0.0.0.0:8000".

Tối ưu hóa Resource và PagedAttention

Cấu hình GPU Memory Utilization

Tham số --gpu-memory-utilization trong vLLM quyết định bao nhiêu phần trăm VRAM được dành cho PagedAttention. Mặc định là 0.9, nhưng trên K8s với container, nên điều chỉnh cẩn thận để tránh OOM (Out of Memory) từ driver hoặc OS.

Trong manifest ở trên, ta đã set 0.95. Nếu bạn chạy nhiều model hoặc model lớn hơn, giảm xuống 0.9 để an toàn.

Để tăng throughput mà không tăng latency, hãy tăng --max-num-seqs (số sequence đồng thời). Default thường thấp, hãy tăng lên.

kubectl set env deployment/vllm-deployment -n vllm-inference VLLM_MAX_NUM_BATCHES=512

Pod sẽ tự động restart để áp dụng cấu hình mới.

Enable PagedAttention và Swap Space

PagedAttention của vLLM tự động chia KV cache thành các page. Để tối ưu, ta cần đảm bảo có đủ không gian swap trong container nếu model quá lớn, hoặc sử dụng --swap-space (tính bằng GiB).

Cập nhật deployment để thêm swap space nếu bạn dùng model > 20B params:

kubectl set env deployment/vllm-deployment -n vllm-inference VLLM_SWAP_SPACE=16

Điều này cho phép vLLM dùng RAM hệ thống làm buffer cho KV cache khi VRAM đầy, giảm lỗi OOM nhưng tăng latency nhẹ. Đây là trade-off cần cân nhắc.

Verify Hiệu năng PagedAttention

Sử dụng metric để xem vLLM đang sử dụng VRAM như thế nào. vLLM có endpoint internal metrics.

POD_NAME=$(kubectl get pods -n vllm-inference -l app=vllm-server -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n vllm-inference $POD_NAME 8000:8000 &
sleep 2
curl http://localhost:8000/metrics

Trong output metrics, tìm dòng vllm:gpu_cache_usage_perc. Giá trị này cho biết mức độ sử dụng cache của PagedAttention. Nếu < 50%, bạn có thể tăng batch size. Nếu > 95%, cần giảm batch size hoặc model size.

Tích hợp OpenAI API Compatible Endpoint

Cấu hình Model Name Mapping

vLLM mặc định hỗ trợ API giống OpenAI. Để client (như LangChain, HuggingFace Inference API clients) hoạt động, model name trong request phải khớp với --served-model-name đã cấu hình.

Trong manifest ở trên, ta đã set --served-model-name mistral-7b. Điều này cho phép bạn gọi API với model name "mistral-7b" bất kể tên thực tế của model file là gì.

Test với OpenAI Client

Cài đặt thư viện openai python trên một container client hoặc laptop của bạn để test.

pip install openai

Tạo script test Python test_inference.py:

from openai import OpenAI

# Cấu hình client để trỏ về vLLM service
client = OpenAI(
    base_url="http://vllm-service.vllm-inference.svc.cluster.local:8000/v1",
    api_key="vllm-local-key", # API key không cần thiết cho internal call nhưng bắt buộc theo schema
)

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Giải thích cơ chế PagedAttention trong vLLM."},
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

Chạy script từ bên trong cluster (hoặc expose service qua LoadBalancer/Ingress nếu test từ ngoài).

python test_inference.py

Kết quả mong đợi: In ra đoạn văn bản giải thích về PagedAttention bằng tiếng Việt (nếu model hỗ trợ) hoặc tiếng Anh.

Test Inference với Model Llama 3 và Kiểm tra Độ trễ

Deploy Llama 3 8B

Thay đổi model sang Llama 3 để kiểm tra khả năng tương thích và hiệu năng. Llama 3 yêu cầu quyền truy cập HuggingFace (cần token).

Đầu tiên, tạo Secret chứa HF_TOKEN:

kubectl create secret generic hf-secret -n vllm-inference \
  --from-literal=HF_TOKEN=YOUR_HUGGINGFACE_TOKEN_HERE

Update deployment để mount secret và đổi model:

cat

Chờ Pod restart và load model mới.

Đo lường Latency và Throughput

Sử dụng vLLM's built-in benchmark tool hoặc curl để đo latency (TTFT - Time To First Token, TPOT - Time Per Output Token).

Tạo script benchmark đơn giản:

#!/bin/bash
MODEL="llama-3-8b"
PROMPT="Explain quantum entanglement in simple terms."

echo "Testing Latency for $MODEL..."
START=$(date +%s%N)
curl -s http://vllm-service.vllm-inference.svc.cluster.local:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"$PROMPT\",
    \"max_tokens\": 100,
    \"stream\": false
  }" > /tmp/response.json
END=$(date +%s%N)

DURATION=$((($END - $START) / 1000000))
echo "Total Time: ${DURATION}ms"

# Parse response để lấy tokens
TOKENS=$(cat /tmp/response.json | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['usage']['completion_tokens'])")
echo "Generated Tokens: $TOKENS"
if [ $TOKENS -gt 0 ]; then
  TPS=$(echo "scale=2; $TOKENS * 1000 / $DURATION" | bc)
  echo "Tokens Per Second: $TPS"
fi

Chạy script từ pod client hoặc laptop:

chmod +x benchmark.sh && ./benchmark.sh

Kết quả mong đợi:

TTFT (thời gian đầu tiên) dưới 500ms cho Llama 3 8B trên 1 GPU A100/H100.
Throughput (Tokens/s) > 50 tokens/s với batch size 1.
Không có lỗi OOM hoặc timeout.

Verify kết quả bằng cách xem logs của vLLM để thấy thông tin về latency thực tế:

kubectl logs -n vllm-inference -l app=vllm-server | grep -i "latency"

Logs sẽ hiển thị chi tiết về "time_to_first_token" và "time_per_output_token" cho từng request.

Điều hướng series:

Mục lục: Series: Series: Xây dựng nền tảng AI inference hiệu năng cao với vLLM, TensorRT-LLM và Kubernetes trên Proxmox

« Phần 2: Xây dựng cụm Kubernetes trên Proxmox bằng KubeVirt hoặc K3s

Phần 4: Tối ưu hóa hiệu năng sâu với TensorRT-LLM và custom Docker image »