vllm-omni-cicd - SKILL.md Agent Skill

name: vllm-omni-cicd description: Set up CI/CD pipelines for vLLM-Omni model deployments including Docker builds, automated testing, rolling updates, and deployment validation. Use when creating deployment pipelines, automating model serving updates, setting up Docker workflows, or configuring GitHub Actions for vllm-omni.

vLLM-Omni CI/CD

Overview

This skill covers CI/CD patterns for deploying and updating vLLM-Omni model serving infrastructure. It includes Docker image builds, automated testing, deployment validation, and rollback strategies.

Docker Build

Production Dockerfile

FROM vllm/vllm-omni:$VLLM_OMNI_VERSION

ARG MODEL_NAME
ENV MODEL_NAME=${MODEL_NAME}

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://localhost:8091/health || exit 1

EXPOSE 8091

CMD ["sh", "-c", "vllm serve ${MODEL_NAME} --omni --port 8091 --host 0.0.0.0"]

Build and push:

docker build --build-arg MODEL_NAME=Tongyi-MAI/Z-Image-Turbo \
  -t my-registry/vllm-omni-z-image:latest .
docker push my-registry/vllm-omni-z-image:latest

Pre-downloading Models

For faster container startup, bake model weights into the image:

FROM vllm/vllm-omni:$VLLM_OMNI_VERSION

RUN python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('Tongyi-MAI/Z-Image-Turbo', local_dir='/models/z-image')"

ENV MODEL_PATH=/models/z-image
CMD ["sh", "-c", "vllm serve ${MODEL_PATH} --omni --port 8091 --host 0.0.0.0"]

GitHub Actions Pipeline

Basic CI

name: vLLM-Omni CI
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pre-commit
      - run: pre-commit run --all-files

  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"
      - run: pytest tests/ -v --ignore=tests/gpu

Build and Push Docker Image

  docker:
    needs: [lint, test]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}/vllm-omni:${{ github.sha }}

GPU Integration Tests

  gpu-test:
    runs-on: [self-hosted, gpu]
    needs: [lint]
    steps:
      - uses: actions/checkout@v4
      - run: |
          docker run --gpus all --rm \
            -v $(pwd):/workspace \
            vllm/vllm-omni:$VLLM_OMNI_VERSION \
            pytest /workspace/tests/gpu/ -v

Deployment Strategies

Rolling Update (Kubernetes)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-omni
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
        - name: vllm-omni
          image: my-registry/vllm-omni:latest
          readinessProbe:
            httpGet:
              path: /health
              port: 8091
            initialDelaySeconds: 120
            periodSeconds: 10
          resources:
            limits:
              nvidia.com/gpu: 1

Blue-Green Deployment

Deploy new version alongside existing ("green" alongside "blue")
Run validation against green deployment
Switch traffic to green
Tear down blue after confirmation

# Deploy green
kubectl apply -f deployment-green.yaml

# Validate green
python scripts/validate_deployment.sh http://green-service:8091

# Switch traffic
kubectl patch service vllm-omni -p '{"spec":{"selector":{"version":"green"}}}'

# Teardown blue (after validation period)
kubectl delete deployment vllm-omni-blue

Deployment Validation

After every deployment, validate:

Health check: /health returns 200
Model loaded: /v1/models returns expected model
Inference works: Send a test prompt, verify response
Latency acceptable: Response time within SLA

Use the validation script:

./scripts/validate_deployment.sh http://localhost:8091

Rollback

Kubernetes

kubectl rollout undo deployment/vllm-omni

Docker Compose

docker compose pull  # pulls previous known-good tag
docker compose up -d

Monitoring in CI/CD

Check GPU memory usage post-deployment
Monitor p50/p99 latency after rollout
Set up alerts for health check failures
Log model version and git SHA for traceability

References

For deployment pipeline templates, see references/pipeline-templates.md