Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 7: Tự động hóa quy trình CI/CD cho DataOps

Tác giả: thinh04 — 21/03/2026

Cấu hình GitHub Actions cho quy trình CI/CD DataOps

Chúng ta cần thiết lập pipeline tự động trên GitHub Actions để kích hoạt khi có commit vào nhánh main. Pipeline này sẽ thực hiện các bước: kéo code, cài đặt môi trường, chạy kiểm thử dữ liệu (data tests), huấn luyện mô hình, và đăng ký lên MLflow nếu đạt ngưỡng.

Tạo file cấu hình workflow tại đường dẫn .github/workflows/dataops-ci-cd.yml. File này định nghĩa các trigger, jobs và các bước thực thi cụ thể.

name: DataOps CI/CD Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  DVC_REMOTE: ${{ secrets.DVC_REMOTE_S3_PATH }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
        with:
          submodules: recursive

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          pip install dvc s3fs

      - name: Configure DVC
        run: |
          dvc remote modify --name remote url $DVC_REMOTE
          dvc remote modify --name remote access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify --name remote secret_access_key $AWS_SECRET_ACCESS_KEY

      - name: Pull data
        run: dvc pull

      - name: Run Data Tests
        run: |
          python tests/test_data_quality.py
        env:
          DATA_PATH: data/processed

      - name: Upload test results
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: test-results
          path: test-reports/

  model-training:
    needs: data-validation
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install dvc s3fs mlflow

      - name: Pull latest data
        run: dvc pull

      - name: Train Model and Log to MLflow
        run: |
          python train.py --experiment-name "CI-Training" --threshold 0.85
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

      - name: Register Model if threshold met
        run: |
          python register_model.py --run-id ${{ steps.train.outputs.run-id }} --threshold 0.85
        if: success()

  deploy-to-k8s:
    needs: model-training
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.25.0'

      - name: Configure Kubernetes Context
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          mkdir -p ~/.kube
          mv kubeconfig ~/.kube/config

      - name: Deploy Model to Kubernetes
        run: |
          python deploy_model.py --model-name "my-ai-model" --k8s-namespace "production"
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
          K8S_TOKEN: ${{ secrets.K8S_TOKEN }}

      - name: Verify Deployment
        run: kubectl rollout status deployment/my-ai-model-deployment -n production

Kết quả mong đợi: Khi bạn push code lên GitHub, Actions sẽ tự động chạy pipeline. Nếu test dữ liệu thất bại, pipeline dừng ở bước data-validation. Nếu thành công, nó sẽ huấn luyện mô hình, log vào MLflow, và nếu đạt threshold, sẽ tự động deploy lên Kubernetes cluster.

Để verify kết quả, truy cập tab "Actions" trên repository GitHub. Chọn workflow vừa chạy, xem chi tiết từng job. Kiểm tra trạng thái "Success" ở bước Deploy to Kubernetes và xem log cuối cùng để đảm bảo rollout status quay về successfully rolled out.

Tự động hóa kiểm thử dữ liệu (Data Quality Tests)

Trước khi huấn luyện, chúng ta cần đảm bảo dữ liệu mới được pull từ DVC không bị hỏng hoặc thay đổi cấu trúc bất ngờ. Chúng ta sẽ viết script kiểm thử đơn giản dùng thư viện pandas hoặc great-expectations.

Tạo file script kiểm thử tại đường dẫn tests/test_data_quality.py. Script này sẽ đọc dữ liệu, kiểm tra số lượng dòng, giá trị null và các trường bắt buộc.

import pandas as pd
import sys
import os

def run_data_tests():
    data_path = os.getenv('DATA_PATH', 'data/processed/train.csv')
    
    if not os.path.exists(data_path):
        print(f"Error: Data file not found at {data_path}")
        sys.exit(1)

    try:
        df = pd.read_csv(data_path)
    except Exception as e:
        print(f"Error: Failed to read CSV: {e}")
        sys.exit(1)

    # Test 1: Check row count
    assert len(df) > 0, "Dataset is empty"
    print(f"Check 1 passed: Row count = {len(df)}")

    # Test 2: Check for critical nulls
    critical_columns = ['target', 'feature_1', 'feature_2']
    for col in critical_columns:
        if col not in df.columns:
            print(f"Error: Missing column {col}")
            sys.exit(1)
        
        null_count = df[col].isnull().sum()
        assert null_count == 0, f"Column {col} has {null_count} null values"
        print(f"Check 2 passed: No nulls in {col}")

    # Test 3: Check data types
    assert df['target'].dtype in ['int64', 'float64'], "Target column must be numeric"
    print("Check 3 passed: Data types are correct")

    print("All data quality tests passed.")
    return True

if __name__ == "__main__":
    run_data_tests()

Kết quả mong đợi: Script chạy và in ra thông báo All data quality tests passed nếu dữ liệu hợp lệ. Nếu dữ liệu bị lỗi, script sẽ in thông báo lỗi cụ thể và trả về mã lỗi (exit code != 0), khiến bước data-validation trong GitHub Actions thất bại.

Để verify kết quả, chạy lệnh python tests/test_data_quality.py trực tiếp trên local hoặc trong terminal của GitHub Actions. Quan sát đầu ra xem có thông báo lỗi nào về null values hay missing columns không.

Logic đăng ký mô hình tự động vào MLflow

Chúng ta không đăng ký mọi mô hình lên MLflow, chỉ đăng ký những mô hình đạt hiệu suất (threshold) cao hơn mô hình hiện tại đang deploy. Điều này giúp tránh rác trong registry và đảm bảo chỉ mô hình tốt nhất mới được đưa vào production.

Tạo script đăng ký tại đường dẫn register_model.py. Script này nhận run-id từ bước huấn luyện, so sánh metric với threshold, và nếu đạt yêu cầu sẽ gọi API mlflow.register_model.

import mlflow
import argparse
import os

def register_model(run_id, threshold, model_name="my-ai-model"):
    mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI'))
    
    run = mlflow.get_run(run_id)
    
    # Lấy metric accuracy từ run
    if 'accuracy' not in run.data.metrics:
        print("Error: 'accuracy' metric not found in run.")
        return False

    current_accuracy = run.data.metrics['accuracy']
    print(f"Current model accuracy: {current_accuracy}")
    print(f"Required threshold: {threshold}")

    if current_accuracy >= threshold:
        try:
            # Đăng ký mô hình vào Model Registry
            model_uri = f"runs:/{run_id}/model"
            registered_model = mlflow.register_model(model_uri, model_name)
            print(f"Model registered successfully: {registered_model.name}")
            return True
        except Exception as e:
            print(f"Error registering model: {e}")
            return False
    else:
        print("Model accuracy below threshold. Skipping registration.")
        return False

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Register MLflow model if threshold met")
    parser.add_argument("--run-id", type=str, help="MLflow Run ID")
    parser.add_argument("--threshold", type=float, default=0.85, help="Accuracy threshold")
    args = parser.parse_args()

    success = register_model(args.run_id, args.threshold)
    if not success:
        exit(1)

Kết quả mong đợi: Nếu accuracy >= threshold, console in ra Model registered successfully và mô hình xuất hiện trong MLflow Model Registry với trạng thái Production hoặc Staging. Nếu không đạt, script in Skipping registration và exit code 1.

Để verify kết quả, truy cập giao diện web của MLflow Tracking Server. Vào tab "Models", tìm model name đã định nghĩa. Kiểm tra xem có phiên bản (version) mới được tạo ngay sau khi chạy script hay không.

Tự động triển khai mô hình lên Kubernetes

Khi mô hình đã được đăng ký vào MLflow, bước cuối là cập nhật deployment trên Kubernetes. Chúng ta sẽ sử dụng script Python để gọi Kubernetes API (thông qua client) hoặc sử dụng kubectl để cập nhật image tag của container đang chạy mô hình.

Tạo script deploy tại đường dẫn deploy_model.py. Script này tải model artifact từ MLflow, đóng gói vào Docker image (hoặc giả lập update image tag nếu đã build sẵn), và cập nhật Deployment resource trên K8s.

import mlflow
import subprocess
import os
import json

def deploy_to_k8s(model_name, namespace):
    mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI'))
    
    # Lấy model version mới nhất (Production)
    model_version = mlflow.get_latest_versions(model_name, stages=['Production'])
    
    if not model_version:
        print("No model version in Production stage found.")
        return False
    
    version = model_version[0].version
    print(f"Deploying model version: {version}")

    # Giả định image tag là model-name:version
    image_tag = f"my-registry.io/{model_name}:{version}"
    print(f"Target image: {image_tag}")

    # Cập nhật deployment trên Kubernetes
    # Sử dụng kubectl set image để đổi image của container
    cmd = [
        "kubectl", "set", "image",
        "-n", namespace,
        "deployment/my-ai-model-deployment",
        f"model-container={image_tag}"
    ]
    
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        print("Kubernetes deployment updated successfully.")
        print(result.stdout)
        return True
    except subprocess.CalledProcessError as e:
        print(f"Failed to update Kubernetes deployment: {e.stderr}")
        return False

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name", type=str, default="my-ai-model")
    parser.add_argument("--k8s-namespace", type=str, default="production")
    args = parser.parse_args()

    success = deploy_to_k8s(args.model_name, args.k8s_namespace)
    if not success:
        exit(1)

Kết quả mong đợi: Kubernetes thực hiện rolling update, các pod cũ bị xóa dần, các pod mới chạy với image tag mới được khởi tạo. Lệnh kubectl rollout status sau đó sẽ trả về deployment "my-ai-model-deployment" successfully rolled out.

Để verify kết quả, chạy lệnh kubectl get pods -n production -l app=my-ai-model để xem các pod mới đã chạy ở trạng thái Running. Sau đó, gọi API endpoint của mô hình (thường là /predict) để kiểm tra xem kết quả trả về có phản ánh logic của mô hình phiên bản mới hay không.

Điều hướng series:

Mục lục: Series: Xây dựng nền tảng DataOps với DVC, MLflow và Kubernetes cho vòng đời AI

« Phần 6: Triển khai mô hình AI lên Kubernetes

Phần 7: Tự động hóa quy trình CI/CD cho DataOps »