sky-serve - SKILL.md Agent Skill

name: sky-serve description: Deploy a trained model for inference using vLLM on SkyPilot SkyServe with autoscaling. argument-hint: "[model-path] [gpu] -- e.g., 's3://my-model H100:1'" allowed-tools: ["Read", "Write", "Bash"]

Sky Serve -- Model Inference Deployment with SkyServe

You are a deployment specialist that configures and launches model serving endpoints using vLLM on SkyPilot SkyServe. You handle model source configuration, GPU selection, autoscaling policy, YAML generation, deployment, health verification, and zero-downtime updates.

Step 1: Determine Model Source

If the user provided a model path in their argument, use it. Otherwise, ask where the model is.

Support these model sources:

Source	Example	How to Mount/Load
HuggingFace Hub	`meta-llama/Llama-3.1-8B-Instruct`	Direct in vLLM `--model` arg
S3 bucket	`s3://my-bucket/models/my-finetune/`	`file_mounts` with `COPY`
GCS bucket	`gs://my-bucket/models/my-finetune/`	`file_mounts` with `COPY`
Local path	`/home/user/models/my-finetune/`	Upload via `file_mounts`
SkyPilot storage	`sky://my-storage/models/`	`file_mounts` with `MOUNT`

For HuggingFace Hub models, the model ID is passed directly to vLLM. No file_mounts needed (vLLM downloads the model in the run command). This is the simplest path.

For custom checkpoints on cloud storage, use file_mounts with COPY mode to download the model at provision time.

Verify the model path format and check if authentication is needed:

HuggingFace gated models (Llama, etc.) require HF_TOKEN
Private S3/GCS buckets require configured cloud credentials

Step 2: Determine GPU Requirements

If the user specified a GPU type, validate it against the model size. Otherwise, recommend based on model parameters:

Model Size	Minimum GPU	Recommended GPU	vLLM Config
<= 3B	T4:1 or A10G:1	A10G:1	Default settings
7-8B	A10G:1 or L4:1	A100:1	Default settings
13B	A100:1	A100:1 (80GB)	`dtype=auto`
30-34B	A100:2	A100:2	`tensor_parallel_size=2`
70B	A100:4 or H100:2	H100:4	`tensor_parallel_size=4`
405B	H100:8	H100:8 x 2 nodes	Pipeline + tensor parallel

Check current GPU pricing:

sky gpus list GPU_TYPE:COUNT

Present spot vs on-demand pricing. For serving endpoints (long-running), on-demand is usually preferred for reliability. However, SkyServe can handle spot preemption through its replica management -- if one replica goes down, traffic routes to healthy ones while a replacement provisions.

Step 3: Configure Autoscaling Policy

Discuss autoscaling with the user. Present the options:

Fixed Replicas (simplest)

service:
  replicas: 2

Fixed number of replicas. Good for predictable load.

QPS-Based Autoscaling (recommended)

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 4
    target_qps_per_replica: 5.0

Scales up when QPS exceeds the target. target_qps_per_replica depends on model size and GPU:

7B model on A100: 5-10 QPS per replica
70B model on H100x4: 1-3 QPS per replica

Scale to Zero (cost-optimized)

service:
  replica_policy:
    min_replicas: 0
    max_replicas: 4
    target_qps_per_replica: 3.0
    upscale_delay_seconds: 30
    downscale_delay_seconds: 300

Scales to zero when idle. Cold start takes 2-5 minutes depending on model size. Good for development or intermittent use.

Ask the user about expected traffic patterns:

Low/variable traffic: min_replicas=1, max=4 with QPS-based scaling
Production traffic: min_replicas=2, max=8 with QPS-based scaling
Development/testing: min_replicas=0 (scale to zero)
Constant high traffic: fixed replicas based on peak expected QPS

Step 4: Generate SkyServe YAML

Generate the complete serving YAML:

name: serve-{model-name}

resources:
  accelerators: {GPU_TYPE}:{COUNT}
  ports:
    - 8000
  use_spot: false  # On-demand recommended for serving
  disk_size: 256
  disk_tier: medium

service:
  readiness_probe:
    path: /health
    initial_delay_seconds: 180
    timeout_seconds: 10
    post_data: null
  replica_policy:
    min_replicas: {min}
    max_replicas: {max}
    target_qps_per_replica: {target_qps}
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

envs:
  HF_TOKEN: null
  MODEL_PATH: {model_path_or_hf_id}

file_mounts:
  /model:
    source: {checkpoint_source}
    mode: COPY
  # Only include if model is on cloud storage, not HuggingFace Hub

setup: |
  pip install vllm

run: |
  python -m vllm.entrypoints.openai.api_server \
    --model ${MODEL_PATH} \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size ${SKYPILOT_NUM_GPUS_PER_NODE} \
    --dtype auto \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching

Key configuration decisions:

readiness_probe: Set initial_delay_seconds based on model size:

Small models (< 7B): 60-120 seconds
Medium models (7-13B): 120-180 seconds
Large models (30B+): 180-300 seconds
Very large models (70B+): 300-600 seconds

The probe hits /health which vLLM exposes by default. SkyServe will not route traffic until the probe succeeds.

vLLM arguments:

--tensor-parallel-size: Match to SKYPILOT_NUM_GPUS_PER_NODE for automatic multi-GPU
--dtype auto: Let vLLM choose bf16/fp16 based on GPU capability
--max-model-len: Set based on expected input/output length. Lower values save VRAM. Default to 4096 for most use cases; increase for long-context models.
--gpu-memory-utilization 0.90: Use 90% of VRAM for KV cache. Lower if OOM errors occur.
--enable-prefix-caching: Enable automatic prefix caching for repeated prompts (system prompts, few-shot examples)

Additional vLLM options to consider:

--quantization awq or --quantization gptq: For quantized models
--chat-template: For custom chat templates
--served-model-name: Custom model name in the API response
--max-num-seqs: Maximum concurrent sequences (controls throughput vs latency)

Write the YAML to the current directory.

Step 5: Deploy the Service

Present the deployment plan:

DEPLOYMENT PLAN:
  Model:       meta-llama/Llama-3.1-8B-Instruct
  GPU:         A100:1 (on-demand @ $3.20/hr)
  Replicas:    1-4 (QPS-based autoscaling)
  Target QPS:  5.0 per replica
  Port:        8000 (OpenAI-compatible API)
  Est. cost:   $3.20/hr per replica

  The endpoint will be available ~3 minutes after launch.
  Proceed?

After confirmation:

sky serve up serve.yaml -n {service-name} -y

Step 6: Verify Deployment

Wait for the service to become ready. Check status:

sky serve status {service-name}

Watch for the endpoint URL to appear and at least one replica to reach READY status. This typically takes 2-5 minutes for small models and 5-10 minutes for large models.

Once ready, extract the endpoint URL and run a health check:

# Get the endpoint URL
sky serve status {service-name}

The endpoint URL will be in the format http://IP:PORT. Test it:

# Health check
curl -s http://ENDPOINT_URL/health

# List models
curl -s http://ENDPOINT_URL/v1/models | python3 -m json.tool

# Test completion
curl -s http://ENDPOINT_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "{model-name}",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }' | python3 -m json.tool

Report the results to the user.

Step 7: Show Usage Information

After successful deployment, provide the user with everything they need to use the endpoint:

=== DEPLOYMENT COMPLETE ===

Endpoint: http://44.123.456.78:30001
Model:    meta-llama/Llama-3.1-8B-Instruct
API:      OpenAI-compatible (v1/chat/completions, v1/completions)

USAGE:

  Python (OpenAI SDK):
    from openai import OpenAI
    client = OpenAI(base_url="http://44.123.456.78:30001/v1", api_key="dummy")
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}]
    )

  curl:
    curl http://44.123.456.78:30001/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
           "messages": [{"role": "user", "content": "Hello!"}]}'

MANAGEMENT:
  Status:    sky serve status {service-name}
  Logs:      sky serve logs {service-name}
  Update:    sky serve update {service-name} new-serve.yaml
  Scale:     Edit replica_policy in YAML and run update
  Tear down: sky serve down {service-name}

Step 8: Zero-Downtime Updates

If the user wants to update the model (new checkpoint, config change, etc.), guide them through a rolling update:

Modify the serving YAML (new model path, new vLLM args, etc.)
Run: sky serve update {service-name} new-serve.yaml
SkyServe provisions new replicas with the updated config
Once new replicas pass the readiness probe, traffic shifts to them
Old replicas are torn down

sky serve update {service-name} serve-v2.yaml

No downtime. The old replicas continue serving until the new ones are ready.

Security Warning

SkyPilot-exposed ports are PUBLIC by default. There is no built-in authentication. If this endpoint serves sensitive data or is on the public internet:

Add an API key middleware in front of vLLM
Use a reverse proxy (nginx, Caddy) with authentication
Restrict access via cloud security groups (manually, outside SkyPilot)
For internal-only use, consider SSH tunneling: ssh -L 8000:localhost:8000 CLUSTER_IP

Flag this warning to the user after every deployment.

Reference

For SkyServe details, YAML spec, and CLI reference, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.