Ghi Chú

Ghi chú nhanh, chia sẻ dễ dàng

Soạn thảo Đơn giản, dễ dàng. Hỗ trợ định dạng văn bản, danh sách, khối code.

Chia sẻ Chuyển sang Công khai để nhận link 5 ký tự. Có thể đặt mật khẩu bảo vệ.

Đính kèm Chèn ảnh hoặc đính kèm file từ thanh công cụ soạn thảo.

Tự động lưu Nội dung được lưu tự động sau 2 giây. Lịch sử chỉnh sửa lưu tối đa 100 phiên bản.

Nâng cao Tự xóa sau thời gian hoặc số lượt đọc. Ghim, khóa chỉ đọc từ sidebar.

Đọc trên Terminal Thêm .txt vào cuối link để xem nội dung dạng văn bản thuần trên terminal.

Phần 7: Tự động hóa quy trình CI/CD cho DataOps

Tác giả: thinh04 — 21/03/2026

Cấu hình pipeline CI/CD với GitHub Actions cho DataOps

Chuẩn bị cấu trúc thư mục và file cấu hình

Chúng ta sẽ tạo file workflow để kích hoạt tự động khi có push vào nhánh main hoặc branch feature.

Tại sao: Để đảm bảo mọi thay đổi về code hoặc data metadata đều được kiểm thử trước khi merge.

Kết quả mong đợi: GitHub Actions hiển thị workflow file trong repository và sẵn sàng trigger.

Tạo file .github/workflows/ci-cd.yml với nội dung hoàn chỉnh:

name: DataOps CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
    paths:
      - 'src/**'
      - 'data/**'
      - 'tests/**'
      - 'dvc.lock'
      - 'dvc.yaml'
  pull_request:
    branches: [ main ]

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  DVC_REMOTE: ${{ secrets.DVC_REMOTE }}
  KUBE_CONFIG: ${{ secrets.KUBE_CONFIG }}

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install Dependencies
        run: |
          pip install -r requirements.txt
          pip install dvc pytest
      - name: Configure DVC
        run: |
          dvc init
          dvc remote add storage $DVC_REMOTE
          dvc remote modify storage credential $DVC_CREDENTIAL
      - name: Fetch Data
        run: dvc pull
      - name: Run Data Tests
        run: pytest tests/test_data_quality.py -v
        continue-on-error: false

  train-model:
    needs: validate-data
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install Dependencies
        run: |
          pip install -r requirements.txt
          pip install dvc mlflow
      - name: Fetch Data
        run: dvc pull
      - name: Train Model
        run: python src/train.py --mlflow-run-name "CI-${{ github.sha }}"
      - name: Log Model to MLflow
        run: |
          mlflow models log-model \
            --artifact-path model \
            --run-id ${{ steps.train.outputs.run-id }} \
            --model-uri ${{ steps.train.outputs.model-uri }}

  deploy-to-k8s:
    needs: train-model
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3
      - name: Set up Kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.24.0'
      - name: Configure Kubeconfig
        run: |
          mkdir -p ~/.kube
          echo "$KUBE_CONFIG" > ~/.kube/config
          chmod 600 ~/.kube/config
      - name: Deploy Model
        run: |
          kubectl set image deployment/ml-model-service ml-model-service=$MLFLOW_MODEL_IMAGE --record
      - name: Verify Deployment
        run: kubectl rollout status deployment/ml-model-service

Thiết lập biến môi trường trong GitHub Secrets

Định nghĩa các biến nhạy cảm cần thiết cho pipeline chạy an toàn.

Tại sao: Không lưu trữ thông tin xác thực (credentials) trực tiếp trong code source.

Kết quả mong đợi: Pipeline có thể truy cập MLflow, DVC storage và Kubernetes cluster.

Vào Settings -> Secrets and variables -> Actions của repository, tạo các secrets sau:

MLFLOW_TRACKING_URI=https://mlflow-tracking.example.com
DVC_REMOTE=s3://my-data-bucket
DVC_CREDENTIAL=aws_access_key_id=AKIA... aws_secret_access_key=...
KUBE_CONFIG=

Verify: Chạy command echo $KUBE_CONFIG | base64 -d > kubeconfig && kubectl cluster-info để kiểm tra khả năng kết nối từ terminal local.

Tự động chạy test dữ liệu và huấn luyện

Cấu hình bài kiểm thử chất lượng dữ liệu (Data Quality Tests)

Viết script Python sử dụng thư viện như Great Expectations hoặc pytest để kiểm tra schema và null values.

Tại sao: Ngăn chặn việc huấn luyện mô hình trên dữ liệu bị hỏng hoặc sai định dạng.

Kết quả mong đợi: Pipeline bị dừng (fail) nếu dữ liệu không đạt chuẩn.

Tạo file tests/test_data_quality.py:

import pandas as pd
import pytest

def load_data():
    return pd.read_csv("data/train.csv")

def test_data_not_empty():
    df = load_data()
    assert len(df) > 0, "Dataset is empty"

def test_no_nulls_in_target():
    df = load_data()
    assert df['target'].isnull().sum() == 0, "Target column contains nulls"

def test_schema_columns():
    df = load_data()
    expected_columns = ['feature1', 'feature2', 'target']
    assert list(df.columns) == expected_columns, f"Schema mismatch. Got {list(df.columns)}"

Chạy huấn luyện mô hình trong môi trường CI

Điều chỉnh script huấn luyện để nhận dữ liệu từ DVC và gửi log về MLflow.

Tại sao: Đảm bảo quy trình huấn luyện tương thích với môi trường không có GPU (hoặc GPU ảo) trong CI.

Kết quả mong đợi: Xuất hiện run mới trên MLflow UI với metrics và model artifacts.

Chỉnh sửa file src/train.py để thêm logic MLflow:

import mlflow
import mlflow.sklearn
import joblib
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

def train():
    with mlflow.start_run():
        # Load data from DVC cache
        df = pd.read_csv("data/train.csv")
        X = df[['feature1', 'feature2']]
        y = df['target']
        
        model = RandomForestClassifier(n_estimators=10)
        model.fit(X, y)
        
        # Log parameters and metrics
        mlflow.log_param("n_estimators", 10)
        accuracy = model.score(X, y)
        mlflow.log_metric("accuracy", accuracy)
        
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        # Output run_id for next steps
        print(f"Run ID: {mlflow.active_run().info.run_id}")
        print(f"Model URI: runs:/{mlflow.active_run().info.run_id}/model")

if __name__ == "__main__":
    train()

Verify: Truy cập MLflow UI và kiểm tra xem run mới có metrics "accuracy" hay không.

Tự động đăng ký mô hình lên MLflow Registry

Logic đăng ký tự động dựa trên threshold

Sử dụng script Python hoặc bash command trong pipeline để so sánh metric và gọi API MLflow.

Tại sao: Chỉ đưa mô hình có chất lượng tốt vào registry để tránh cluttering và giảm rủi ro deploy.

Kết quả mong đợi: Mô hình mới xuất hiện trong "Model Registry" với version tăng dần nếu đạt chuẩn.

Tạo file scripts/register_model.py:

import mlflow
import sys

def register_model(run_id, model_name, metric_name, threshold):
    client = mlflow.tracking.MlflowClient()
    run = client.get_run(run_id)
    
    if metric_name not in run.data.metrics:
        print(f"Metric {metric_name} not found in run {run_id}")
        sys.exit(1)
        
    metric_value = run.data.metrics[metric_name]
    print(f"Current {metric_name}: {metric_value}, Threshold: {threshold}")
    
    if metric_value >= threshold:
        model_uri = f"runs:/{run_id}/model"
        client.register_model(model_uri, model_name)
        print(f"Model registered successfully: {model_name}")
        return True
    else:
        print("Model did not meet threshold. Skipping registration.")
        return False

if __name__ == "__main__":
    # Args: run_id model_name metric_name threshold
    run_id = sys.argv[1]
    model_name = sys.argv[2]
    metric_name = sys.argv[3]
    threshold = float(sys.argv[4])
    
    register_model(run_id, model_name, metric_name, threshold)

Cập nhật file workflow .github/workflows/ci-cd.yml thêm bước này vào job train-model:

- name: Register Model if Threshold Met
  if: steps.train.outputs.run-id != ''
  run: |
    python scripts/register_model.py \
      ${{ steps.train.outputs.run-id }} \
      "my-production-model" \
      "accuracy" \
      "0.85"

Verify: Vào MLflow UI -> Model Registry, kiểm tra xem model "my-production-model" có version mới nhất không.

Tự động triển khai mô hình lên Kubernetes

Chuẩn bị Dockerfile và Helm Chart

Dockerfile để build image chứa model, Helm Chart để manage deployment trên K8s.

Tại sao: Tách biệt logic build image và logic deploy, giúp dễ dàng scale và update.

Kết quả mong đợi: Image được push lên registry, Helm chart sẵn sàng deploy.

Tạo file Dockerfile trong thư mục deployment/:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY src/predict.py .
RUN pip install mlflow

ENV MLFLOW_TRACKING_URI=${MLFLOW_TRACKING_URI}
ENV MODEL_NAME=my-production-model

CMD ["python", "predict.py"]

Tạo file deployment/templates/deployment.yaml (Helm template):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-service
  labels:
    app: ml-model-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-model-service
  template:
    metadata:
      labels:
        app: ml-model-service
    spec:
      containers:
      - name: ml-model-service
        image: my-registry/ml-model-service:{{ .Values.imageTag }}
        ports:
        - containerPort: 8000
        env:
        - name: MLFLOW_TRACKING_URI
          value: "{{ .Values.mlflowUri }}"
        - name: MODEL_NAME
          value: "{{ .Values.modelName }}"
        - name: MODEL_VERSION
          value: "{{ .Values.modelVersion }}"

Tạo file deployment/values.yaml:

imageTag: latest
mlflowUri: https://mlflow-tracking.example.com
modelName: my-production-model
modelVersion: 1

Triển khai tự động qua GitHub Actions

Sử dụng kubectl hoặc Helm trong pipeline để cập nhật image tag sau khi model được đăng ký.

Tại sao: Đảm bảo version model trên K8s luôn đồng bộ với version đã được approve trong Registry.

Kết quả mong đợi: Pod mới được tạo, image cũ bị kill, traffic chuyển sang model mới.

Cập nhật bước deploy trong file workflow .github/workflows/ci-cd.yml:

- name: Deploy Model to Kubernetes
  run: |
    # Extract latest version from MLflow Registry
    LATEST_VERSION=$(mlflow models list-versions --name "my-production-model" --limit 1 --output json | jq -r '.versions[0].version')
    
    # Build and Push Docker Image
    docker build -t my-registry/ml-model-service:$LATEST_VERSION -f deployment/Dockerfile .
    docker push my-registry/ml-model-service:$LATEST_VERSION
    
    # Update Kubernetes Deployment
    kubectl set image deployment/ml-model-service ml-model-service=my-registry/ml-model-service:$LATEST_VERSION --record
    
    # Update Helm values (optional if using Helm)
    helm upgrade ml-model-service ./deployment -f deployment/values.yaml --set imageTag=$LATEST_VERSION --set modelVersion=$LATEST_VERSION

Verify: Chạy command kubectl get pods -l app=ml-model-service và kubectl rollout history deployment/ml-model-service để xem lịch sử cập nhật.

Xác nhận quy trình hoàn chỉnh

Tạo một commit nhỏ vào nhánh develop để trigger toàn bộ pipeline.

Tại sao: Kiểm tra end-to-end từ data test đến deployment.

Kết quả mong đợi: Xem logs trên GitHub Actions, thấy trạng thái "Success" ở tất cả các job.

Thực hiện lệnh:

git add .
git commit -m "feat: update training logic to trigger CI/CD"
git push origin develop

Verify: Truy cập tab "Actions" trên GitHub, click vào workflow vừa chạy, xem logs của từng job để đảm bảo không có lỗi.

Điều hướng series:

Mục lục: Series: Xây dựng nền tảng DataOps với DVC, MLflow và Kubernetes cho vòng đời AI

« Phần 7: Tự động hóa quy trình CI/CD cho DataOps

Phần 8: Giám sát và ghi log hệ thống DataOps »