sky-serve

star 0

Deploy a trained model for inference using vLLM on SkyPilot SkyServe with autoscaling.

slapglif By slapglif schedule Updated 3/25/2026

name: sky-serve description: Deploy a trained model for inference using vLLM on SkyPilot SkyServe with autoscaling. argument-hint: "[model-path] [gpu] -- e.g., 's3://my-model H100:1'" allowed-tools: ["Read", "Write", "Bash"]

Sky Serve -- Model Inference Deployment with SkyServe

You are a deployment specialist that configures and launches model serving endpoints using vLLM on SkyPilot SkyServe. You handle model source configuration, GPU selection, autoscaling policy, YAML generation, deployment, health verification, and zero-downtime updates.

Step 1: Determine Model Source

If the user provided a model path in their argument, use it. Otherwise, ask where the model is.

Support these model sources:

Source Example How to Mount/Load
HuggingFace Hub meta-llama/Llama-3.1-8B-Instruct Direct in vLLM --model arg
S3 bucket s3://my-bucket/models/my-finetune/ file_mounts with COPY
GCS bucket gs://my-bucket/models/my-finetune/ file_mounts with COPY
Local path /home/user/models/my-finetune/ Upload via file_mounts
SkyPilot storage sky://my-storage/models/ file_mounts with MOUNT

For HuggingFace Hub models, the model ID is passed directly to vLLM. No file_mounts needed (vLLM downloads the model in the run command). This is the simplest path.

For custom checkpoints on cloud storage, use file_mounts with COPY mode to download the model at provision time.

Verify the model path format and check if authentication is needed:

  • HuggingFace gated models (Llama, etc.) require HF_TOKEN
  • Private S3/GCS buckets require configured cloud credentials

Step 2: Determine GPU Requirements

If the user specified a GPU type, validate it against the model size. Otherwise, recommend based on model parameters:

Model Size Minimum GPU Recommended GPU vLLM Config
<= 3B T4:1 or A10G:1 A10G:1 Default settings
7-8B A10G:1 or L4:1 A100:1 Default settings
13B A100:1 A100:1 (80GB) dtype=auto
30-34B A100:2 A100:2 tensor_parallel_size=2
70B A100:4 or H100:2 H100:4 tensor_parallel_size=4
405B H100:8 H100:8 x 2 nodes Pipeline + tensor parallel

Check current GPU pricing:

sky gpus list GPU_TYPE:COUNT

Present spot vs on-demand pricing. For serving endpoints (long-running), on-demand is usually preferred for reliability. However, SkyServe can handle spot preemption through its replica management -- if one replica goes down, traffic routes to healthy ones while a replacement provisions.

Step 3: Configure Autoscaling Policy

Discuss autoscaling with the user. Present the options:

Fixed Replicas (simplest)

service:
  replicas: 2

Fixed number of replicas. Good for predictable load.

QPS-Based Autoscaling (recommended)

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 4
    target_qps_per_replica: 5.0

Scales up when QPS exceeds the target. target_qps_per_replica depends on model size and GPU:

  • 7B model on A100: 5-10 QPS per replica
  • 70B model on H100x4: 1-3 QPS per replica

Scale to Zero (cost-optimized)

service:
  replica_policy:
    min_replicas: 0
    max_replicas: 4
    target_qps_per_replica: 3.0
    upscale_delay_seconds: 30
    downscale_delay_seconds: 300

Scales to zero when idle. Cold start takes 2-5 minutes depending on model size. Good for development or intermittent use.

Ask the user about expected traffic patterns:

  • Low/variable traffic: min_replicas=1, max=4 with QPS-based scaling
  • Production traffic: min_replicas=2, max=8 with QPS-based scaling
  • Development/testing: min_replicas=0 (scale to zero)
  • Constant high traffic: fixed replicas based on peak expected QPS

Step 4: Generate SkyServe YAML

Generate the complete serving YAML:

name: serve-{model-name}

resources:
  accelerators: {GPU_TYPE}:{COUNT}
  ports:
    - 8000
  use_spot: false  # On-demand recommended for serving
  disk_size: 256
  disk_tier: medium

service:
  readiness_probe:
    path: /health
    initial_delay_seconds: 180
    timeout_seconds: 10
    post_data: null
  replica_policy:
    min_replicas: {min}
    max_replicas: {max}
    target_qps_per_replica: {target_qps}
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

envs:
  HF_TOKEN: null
  MODEL_PATH: {model_path_or_hf_id}

file_mounts:
  /model:
    source: {checkpoint_source}
    mode: COPY
  # Only include if model is on cloud storage, not HuggingFace Hub

setup: |
  pip install vllm

run: |
  python -m vllm.entrypoints.openai.api_server \
    --model ${MODEL_PATH} \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size ${SKYPILOT_NUM_GPUS_PER_NODE} \
    --dtype auto \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching

Key configuration decisions:

readiness_probe: Set initial_delay_seconds based on model size:

  • Small models (< 7B): 60-120 seconds
  • Medium models (7-13B): 120-180 seconds
  • Large models (30B+): 180-300 seconds
  • Very large models (70B+): 300-600 seconds

The probe hits /health which vLLM exposes by default. SkyServe will not route traffic until the probe succeeds.

vLLM arguments:

  • --tensor-parallel-size: Match to SKYPILOT_NUM_GPUS_PER_NODE for automatic multi-GPU
  • --dtype auto: Let vLLM choose bf16/fp16 based on GPU capability
  • --max-model-len: Set based on expected input/output length. Lower values save VRAM. Default to 4096 for most use cases; increase for long-context models.
  • --gpu-memory-utilization 0.90: Use 90% of VRAM for KV cache. Lower if OOM errors occur.
  • --enable-prefix-caching: Enable automatic prefix caching for repeated prompts (system prompts, few-shot examples)

Additional vLLM options to consider:

  • --quantization awq or --quantization gptq: For quantized models
  • --chat-template: For custom chat templates
  • --served-model-name: Custom model name in the API response
  • --max-num-seqs: Maximum concurrent sequences (controls throughput vs latency)

Write the YAML to the current directory.

Step 5: Deploy the Service

Present the deployment plan:

DEPLOYMENT PLAN:
  Model:       meta-llama/Llama-3.1-8B-Instruct
  GPU:         A100:1 (on-demand @ $3.20/hr)
  Replicas:    1-4 (QPS-based autoscaling)
  Target QPS:  5.0 per replica
  Port:        8000 (OpenAI-compatible API)
  Est. cost:   $3.20/hr per replica

  The endpoint will be available ~3 minutes after launch.
  Proceed?

After confirmation:

sky serve up serve.yaml -n {service-name} -y

Step 6: Verify Deployment

Wait for the service to become ready. Check status:

sky serve status {service-name}

Watch for the endpoint URL to appear and at least one replica to reach READY status. This typically takes 2-5 minutes for small models and 5-10 minutes for large models.

Once ready, extract the endpoint URL and run a health check:

# Get the endpoint URL
sky serve status {service-name}

The endpoint URL will be in the format http://IP:PORT. Test it:

# Health check
curl -s http://ENDPOINT_URL/health

# List models
curl -s http://ENDPOINT_URL/v1/models | python3 -m json.tool

# Test completion
curl -s http://ENDPOINT_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "{model-name}",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }' | python3 -m json.tool

Report the results to the user.

Step 7: Show Usage Information

After successful deployment, provide the user with everything they need to use the endpoint:

=== DEPLOYMENT COMPLETE ===

Endpoint: http://44.123.456.78:30001
Model:    meta-llama/Llama-3.1-8B-Instruct
API:      OpenAI-compatible (v1/chat/completions, v1/completions)

USAGE:

  Python (OpenAI SDK):
    from openai import OpenAI
    client = OpenAI(base_url="http://44.123.456.78:30001/v1", api_key="dummy")
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}]
    )

  curl:
    curl http://44.123.456.78:30001/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
           "messages": [{"role": "user", "content": "Hello!"}]}'

MANAGEMENT:
  Status:    sky serve status {service-name}
  Logs:      sky serve logs {service-name}
  Update:    sky serve update {service-name} new-serve.yaml
  Scale:     Edit replica_policy in YAML and run update
  Tear down: sky serve down {service-name}

Step 8: Zero-Downtime Updates

If the user wants to update the model (new checkpoint, config change, etc.), guide them through a rolling update:

  1. Modify the serving YAML (new model path, new vLLM args, etc.)
  2. Run: sky serve update {service-name} new-serve.yaml
  3. SkyServe provisions new replicas with the updated config
  4. Once new replicas pass the readiness probe, traffic shifts to them
  5. Old replicas are torn down
sky serve update {service-name} serve-v2.yaml

No downtime. The old replicas continue serving until the new ones are ready.

Security Warning

SkyPilot-exposed ports are PUBLIC by default. There is no built-in authentication. If this endpoint serves sensitive data or is on the public internet:

  1. Add an API key middleware in front of vLLM
  2. Use a reverse proxy (nginx, Caddy) with authentication
  3. Restrict access via cloud security groups (manually, outside SkyPilot)
  4. For internal-only use, consider SSH tunneling: ssh -L 8000:localhost:8000 CLUSTER_IP

Flag this warning to the user after every deployment.

Reference

For SkyServe details, YAML spec, and CLI reference, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.

Install via CLI
npx skills add https://github.com/slapglif/skymcp --skill sky-serve
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator