Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 4: Tối ưu hóa hiệu năng sâu với TensorRT-LLM và custom Docker image

Tác giả: khang.nguyen92 — 21/03/2026

Xây dựng Custom Docker Image TensorRT-LLM từ nguồn

Để đạt hiệu năng cực đại, chúng ta không dùng image public mà build trực tiếp từ source code TensorRT-LLM. Điều này cho phép kiểm soát chính xác phiên bản CUDA, thư viện TensorRT và các flag biên dịch tối ưu cho kiến trúc GPU cụ thể.

Đầu tiên, clone repository TensorRT-LLM và tạo Dockerfile tùy chỉnh để giảm kích thước image và loại bỏ các thành phần không cần thiết.

Tạo file /root/tensorrt-llm-build/Dockerfile với nội dung hoàn chỉnh:

FROM nvidia/cuda:12.3.1-cudnn8-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV TRT_LLM_VERSION=0.10.0
ENV CUDA_ARCHITECTURES="80;86;89" # Target Ampere (A100) and Hopper (H100)

RUN apt-get update && apt-get install -y \
    git \
    build-essential \
    cmake \
    python3-pip \
    python3-dev \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Cài đặt PyTorch và TensorRT
RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install tensorrt==8.6.1

# Clone và build TensorRT-LLM
WORKDIR /workspace
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
    cd TensorRT-LLM && \
    git checkout v$TRT_LLM_VERSION && \
    python3 setup.py bdist_wheel

# Cài đặt wheel vừa build
RUN pip3 install dist/*.whl

# Cài đặt các thư viện hỗ trợ inference
RUN pip3 install vllm==0.5.0 transformers accelerate

# Tối ưu hóa runtime
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
ENV NVIDIA_VISIBLE_DEVICES=all

WORKDIR /app
CMD ["python3", "-m", "tensorrt_llm.server"]

Build image Docker với tag cụ thể để quản lý phiên bản:

docker build -t trt-llm-custom:0.10.0 -f /root/tensorrt-llm-build/Dockerfile /root/tensorrt-llm-build

Kết quả mong đợi: Image được tạo thành công, kích thước nhỏ hơn image public do đã loại bỏ các package phát triển không cần thiết, và sẵn sàng chạy trên GPU.

Verify: Chạy lệnh docker images | grep trt-llm-custom để kiểm tra sự tồn tại của image.

Chuyển đổi Model từ Hugging Face sang TensorRT Engine

TensorRT-LLM yêu cầu model được compile thành Engine (định dạng .plan) trước khi inference. Quá trình này tối ưu hóa graph tính toán và memory access pattern cho GPU cụ thể.

Chúng ta sẽ sử dụng script `build.py` có sẵn trong TensorRT-LLM để convert model Llama-2-7B (hoặc model tương đương bạn đang dùng) sang định dạng TensorRT.

Tạo script build engine /root/scripts/build_engine.sh:

#!/bin/bash

# Đường dẫn đến model Hugging Face
MODEL_NAME="meta-llama/Llama-2-7b-hf"
# Đường dẫn lưu engine
OUTPUT_DIR="/mnt/trt-engines/llama2-7b"
# Số lượng GPU (giả sử 1 card A100/H100 cho demo)
GPUS=1
# Precision: FP8 cho hiệu năng cao nhất, FP16 cho độ chính xác cao
PRECISION="fp8"

mkdir -p $OUTPUT_DIR

docker run --rm --gpus all \
    -v /root/models:/root/models:ro \
    -v /mnt/trt-engines:/mnt/trt-engines \
    -e CUDA_VISIBLE_DEVICES=0 \
    trt-llm-custom:0.10.0 \
    python3 -m tensorrt_llm.tools.convert_checkpoint \
    --model_name $MODEL_NAME \
    --output_dir /mnt/trt-engines/llama2-7b \
    --dtype $PRECISION \
    --use_weight_only_quantization=false \
    --max_num_tokens 4096 \
    --max_num_seqs 16 \
    --max_beam_width 1 \
    --max_input_len 2048 \
    --max_output_len 2048

Chạy script sau khi đã tải model về thư mục `/root/models` (hoặc chỉnh sửa đường dẫn trong script):

chmod +x /root/scripts/build_engine.sh && /root/scripts/build_engine.sh

Kết quả mong đợi: Script sẽ thực hiện quá trình build, hiển thị log về quá trình compile, và tạo ra file `model.engine` cùng các file metadata trong thư mục `/mnt/trt-engines/llama2-7b`.

Verify: Kiểm tra thư mục output với lệnh ls -lh /mnt/trt-engines/llama2-7b. Bạn phải thấy file `.engine` có kích thước lớn (vài GB).

Cấu hình Plugin NVIDIA và Tối ưu Memory Usage

Để engine hoạt động ổn định và tận dụng hết bộ nhớ GPU, cần cấu hình các plugin của NVIDIA (cuDNN, NCCL) và tinh chỉnh tham số memory pool trong file cấu hình inference.

Tạo file cấu hình YAML cho TensorRT-LLM server /root/configs/trt-llm-config.yaml:

model:
  name: llama2-7b
  dtype: fp8
  max_num_tokens: 4096
  max_num_seqs: 16
  max_beam_width: 1
  max_input_len: 2048
  max_output_len: 2048

engine:
  build:
    precision: fp8
    use_weight_only_quantization: false
    remove_input_padding: true
    remove_output_padding: true
  runtime:
    memory_pool_size: 0.8 # Dành 80% VRAM cho memory pool, tránh OOM
    enable_custom_all_reduce: true # Tối ưu cho multi-GPU nếu cần
    tensor_parallel_size: 1 # Số GPU cho 1 model instance
    pipeline_parallel_size: 1

server:
  host: "0.0.0.0"
  port: 8000
  log_level: "INFO"
  enable_metrics: true

Áp dụng các biến môi trường NVIDIA để tối ưu hóa hiệu năng mạng và bộ nhớ trong container:

export NCCL_NSOCKET_IFACE=eth0
export NCCL_IB_DISABLE=1
export NCCL_DEBUG=INFO

Kết quả mong đợi: Server sẽ khởi động với giới hạn bộ nhớ an toàn, tránh crash do OOM (Out of Memory) và kích hoạt các thuật toán tối ưu cho phép xử lý song song.

Verify: Chạy lệnh nvidia-smi trong container đang chạy để kiểm tra VRAM usage. Nó không nên đạt 100% ngay khi khởi động mà giữ ở mức ~80% như cấu hình.

Triển khai TensorRT-LLM Server trên Kubernetes

Thay thế deployment vLLM trước đó bằng TensorRT-LLM. Chúng ta sẽ sử dụng Custom Resource (YAML) để định nghĩa Pod và Service, mount volume chứa engine đã build sẵn.

Tạo file Kubernetes manifest /root/k8s-manifests/trt-llm-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: trt-llm-inference
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: trt-llm
  template:
    metadata:
      labels:
        app: trt-llm
    spec:
      containers:
      - name: trt-llm-server
        image: trt-llm-custom:0.10.0
        imagePullPolicy: Never
        command: ["python3"]
        args:
          - "-m"
          - "tensorrt_llm.server"
          - "--config"
          - "/app/config/trt-llm-config.yaml"
        ports:
        - containerPort: 8000
          name: http
        volumeMounts:
        - name: engine-storage
          mountPath: /mnt/trt-engines
        - name: config-storage
          mountPath: /app/config
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        env:
        - name: NCCL_NSOCKET_IFACE
          value: "eth0"
        - name: NCCL_IB_DISABLE
          value: "1"
      volumes:
      - name: engine-storage
        hostPath:
          path: /mnt/trt-engines
      - name: config-storage
        configMap:
          name: trt-llm-config
---
apiVersion: v1
kind: Service
metadata:
  name: trt-llm-service
  namespace: ai-inference
spec:
  selector:
    app: trt-llm
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Đầu tiên, tạo ConfigMap chứa file cấu hình YAML đã tạo ở phần trước:

kubectl create configmap trt-llm-config --namespace ai-inference --from-file=trt-llm-config.yaml=/root/configs/trt-llm-config.yaml

Triển khai Deployment và Service:

kubectl apply -f /root/k8s-manifests/trt-llm-deployment.yaml

Kiểm tra trạng thái Pod để đảm bảo nó đã lên trạng thái Running và đã mount được GPU:

kubectl get pods -n ai-inference -l app=trt-llm

Kiểm tra log để xác nhận server đã load engine thành công:

kubectl logs -f -n ai-inference $(kubectl get pods -n ai-inference -l app=trt-llm -o jsonpath='{.items[0].metadata.name}') | grep "Engine loaded"

Kết quả mong đợi: Pod ở trạng thái Running, log hiển thị thông báo "Engine loaded successfully" và Service đã expose port 8000.

So sánh Benchmark Hiệu năng: vLLM vs TensorRT-LLM

Để chứng minh hiệu quả của việc tối ưu hóa, chúng ta sẽ chạy benchmark đơn giản đo lường Tokens per Second (TPS) và Latency cho cùng một prompt trên cả hai hệ thống.

Sử dụng Python script đơn giản để gửi request và đo thời gian. Giả sử vLLM đang chạy ở port 8001 và TensorRT-LLM ở port 8000.

Tạo script benchmark /root/scripts/benchmark.py:

import requests
import time
import json

def benchmark_inference(url, prompt, max_tokens=512):
    payload = {
        "model": "llama2-7b",
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "top_p": 0.95,
        "stream": False
    }
    
    start_time = time.time()
    try:
        response = requests.post(f"{url}/generate", json=payload, timeout=60)
        if response.status_code == 200:
            data = response.json()
            # Lấy số lượng token thực tế trả về (giả sử API trả về trường token_count hoặc đếm từ)
            generated_text = data.get("text", "")
            num_tokens = len(generated_text.split()) # Estimation
            end_time = time.time()
            duration = end_time - start_time
            tokens_per_sec = num_tokens / duration
            return {
                "status": "success",
                "latency_ms": duration * 1000,
                "tokens_per_sec": tokens_per_sec,
                "num_tokens": num_tokens
            }
        else:
            return {"status": "error", "message": response.text}
    except Exception as e:
        return {"status": "error", "message": str(e)}

# Prompt mẫu dài để đo hiệu năng thực tế
PROMPT = """Hãy viết một bài luận chi tiết về lịch sử phát triển của trí tuệ nhân tạo, từ các mô hình logic ban đầu đến các mạng neural sâu hiện đại. Phân tích các bước đột phá quan trọng và tác động của chúng đến xã hội. Bài viết cần có cấu trúc rõ ràng, luận điểm mạch lạc và dẫn chứng cụ thể."""

print("Bắt đầu benchmark...")
print(f"Prompt length: {len(PROMPT)} characters")

# Test vLLM (Giả sử đã chạy ở port 8001)
print("\n--- Testing vLLM ---")
result_vllm = benchmark_inference("http://vllm-service.ai-inference:8000", PROMPT)
print(json.dumps(result_vllm, indent=2))

# Test TensorRT-LLM (Chạy ở port 8000)
print("\n--- Testing TensorRT-LLM ---")
result_trt = benchmark_inference("http://trt-llm-service.ai-inference:8000", PROMPT)
print(json.dumps(result_trt, indent=2))

# So sánh
if result_vllm['status'] == 'success' and result_trt['status'] == 'success':
    speedup = result_trt['tokens_per_sec'] / result_vllm['tokens_per_sec']
    print(f"\n=== KẾT QUẢ SO SÁNH ===")
    print(f"Speedup (TensorRT-LLM so với vLLM): {speedup:.2f}x")
    print(f"Latency giảm: {((result_vllm['latency_ms'] - result_trt['latency_ms']) / result_vllm['latency_ms'] * 100):.2f}%")

Chạy script benchmark từ một container khác hoặc từ node client có kết nối đến Kubernetes Service:

python3 /root/scripts/benchmark.py

Kết quả mong đợi: TensorRT-LLM thường cho kết quả Tokens per Second cao hơn (thường từ 1.5x đến 3x tùy vào độ dài prompt và kiến trúc GPU) và latency thấp hơn so với vLLM mặc định, đặc biệt rõ rệt ở các request có độ dài lớn hoặc throughput cao.

Verify: Quan sát chỉ số `tokens_per_sec` và `latency_ms` trong output. TensorRT-LLM phải có giá trị `tokens_per_sec` lớn hơn vLLM.

Điều hướng series:

Mục lục: Series: Series: Xây dựng nền tảng AI inference hiệu năng cao với vLLM, TensorRT-LLM và Kubernetes trên Proxmox

« Phần 3: Triển khai vLLM trên Kubernetes với cơ chế PagedAttention

Phần 5: Cấu hình Scaling tự động và cân bằng tải cho dịch vụ AI »