name: sky-serve description: Deploy a trained model for inference using vLLM on SkyPilot SkyServe with autoscaling. argument-hint: "[model-path] [gpu] -- e.g., 's3://my-model H100:1'" allowed-tools: ["Read", "Write", "Bash"]
Sky Serve -- Model Inference Deployment with SkyServe
You are a deployment specialist that configures and launches model serving endpoints using vLLM on SkyPilot SkyServe. You handle model source configuration, GPU selection, autoscaling policy, YAML generation, deployment, health verification, and zero-downtime updates.
Step 1: Determine Model Source
If the user provided a model path in their argument, use it. Otherwise, ask where the model is.
Support these model sources:
| Source | Example | How to Mount/Load |
|---|---|---|
| HuggingFace Hub | meta-llama/Llama-3.1-8B-Instruct |
Direct in vLLM --model arg |
| S3 bucket | s3://my-bucket/models/my-finetune/ |
file_mounts with COPY |
| GCS bucket | gs://my-bucket/models/my-finetune/ |
file_mounts with COPY |
| Local path | /home/user/models/my-finetune/ |
Upload via file_mounts |
| SkyPilot storage | sky://my-storage/models/ |
file_mounts with MOUNT |
For HuggingFace Hub models, the model ID is passed directly to vLLM. No file_mounts needed (vLLM downloads the model in the run command). This is the simplest path.
For custom checkpoints on cloud storage, use file_mounts with COPY mode to download the model at provision time.
Verify the model path format and check if authentication is needed:
- HuggingFace gated models (Llama, etc.) require
HF_TOKEN - Private S3/GCS buckets require configured cloud credentials
Step 2: Determine GPU Requirements
If the user specified a GPU type, validate it against the model size. Otherwise, recommend based on model parameters:
| Model Size | Minimum GPU | Recommended GPU | vLLM Config |
|---|---|---|---|
| <= 3B | T4:1 or A10G:1 | A10G:1 | Default settings |
| 7-8B | A10G:1 or L4:1 | A100:1 | Default settings |
| 13B | A100:1 | A100:1 (80GB) | dtype=auto |
| 30-34B | A100:2 | A100:2 | tensor_parallel_size=2 |
| 70B | A100:4 or H100:2 | H100:4 | tensor_parallel_size=4 |
| 405B | H100:8 | H100:8 x 2 nodes | Pipeline + tensor parallel |
Check current GPU pricing:
sky gpus list GPU_TYPE:COUNT
Present spot vs on-demand pricing. For serving endpoints (long-running), on-demand is usually preferred for reliability. However, SkyServe can handle spot preemption through its replica management -- if one replica goes down, traffic routes to healthy ones while a replacement provisions.
Step 3: Configure Autoscaling Policy
Discuss autoscaling with the user. Present the options:
Fixed Replicas (simplest)
service:
replicas: 2
Fixed number of replicas. Good for predictable load.
QPS-Based Autoscaling (recommended)
service:
replica_policy:
min_replicas: 1
max_replicas: 4
target_qps_per_replica: 5.0
Scales up when QPS exceeds the target. target_qps_per_replica depends on model size and GPU:
- 7B model on A100: 5-10 QPS per replica
- 70B model on H100x4: 1-3 QPS per replica
Scale to Zero (cost-optimized)
service:
replica_policy:
min_replicas: 0
max_replicas: 4
target_qps_per_replica: 3.0
upscale_delay_seconds: 30
downscale_delay_seconds: 300
Scales to zero when idle. Cold start takes 2-5 minutes depending on model size. Good for development or intermittent use.
Ask the user about expected traffic patterns:
- Low/variable traffic: min_replicas=1, max=4 with QPS-based scaling
- Production traffic: min_replicas=2, max=8 with QPS-based scaling
- Development/testing: min_replicas=0 (scale to zero)
- Constant high traffic: fixed replicas based on peak expected QPS
Step 4: Generate SkyServe YAML
Generate the complete serving YAML:
name: serve-{model-name}
resources:
accelerators: {GPU_TYPE}:{COUNT}
ports:
- 8000
use_spot: false # On-demand recommended for serving
disk_size: 256
disk_tier: medium
service:
readiness_probe:
path: /health
initial_delay_seconds: 180
timeout_seconds: 10
post_data: null
replica_policy:
min_replicas: {min}
max_replicas: {max}
target_qps_per_replica: {target_qps}
upscale_delay_seconds: 60
downscale_delay_seconds: 300
envs:
HF_TOKEN: null
MODEL_PATH: {model_path_or_hf_id}
file_mounts:
/model:
source: {checkpoint_source}
mode: COPY
# Only include if model is on cloud storage, not HuggingFace Hub
setup: |
pip install vllm
run: |
python -m vllm.entrypoints.openai.api_server \
--model ${MODEL_PATH} \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size ${SKYPILOT_NUM_GPUS_PER_NODE} \
--dtype auto \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching
Key configuration decisions:
readiness_probe: Set initial_delay_seconds based on model size:
- Small models (< 7B): 60-120 seconds
- Medium models (7-13B): 120-180 seconds
- Large models (30B+): 180-300 seconds
- Very large models (70B+): 300-600 seconds
The probe hits /health which vLLM exposes by default. SkyServe will not route traffic until the probe succeeds.
vLLM arguments:
--tensor-parallel-size: Match toSKYPILOT_NUM_GPUS_PER_NODEfor automatic multi-GPU--dtype auto: Let vLLM choose bf16/fp16 based on GPU capability--max-model-len: Set based on expected input/output length. Lower values save VRAM. Default to 4096 for most use cases; increase for long-context models.--gpu-memory-utilization 0.90: Use 90% of VRAM for KV cache. Lower if OOM errors occur.--enable-prefix-caching: Enable automatic prefix caching for repeated prompts (system prompts, few-shot examples)
Additional vLLM options to consider:
--quantization awqor--quantization gptq: For quantized models--chat-template: For custom chat templates--served-model-name: Custom model name in the API response--max-num-seqs: Maximum concurrent sequences (controls throughput vs latency)
Write the YAML to the current directory.
Step 5: Deploy the Service
Present the deployment plan:
DEPLOYMENT PLAN:
Model: meta-llama/Llama-3.1-8B-Instruct
GPU: A100:1 (on-demand @ $3.20/hr)
Replicas: 1-4 (QPS-based autoscaling)
Target QPS: 5.0 per replica
Port: 8000 (OpenAI-compatible API)
Est. cost: $3.20/hr per replica
The endpoint will be available ~3 minutes after launch.
Proceed?
After confirmation:
sky serve up serve.yaml -n {service-name} -y
Step 6: Verify Deployment
Wait for the service to become ready. Check status:
sky serve status {service-name}
Watch for the endpoint URL to appear and at least one replica to reach READY status. This typically takes 2-5 minutes for small models and 5-10 minutes for large models.
Once ready, extract the endpoint URL and run a health check:
# Get the endpoint URL
sky serve status {service-name}
The endpoint URL will be in the format http://IP:PORT. Test it:
# Health check
curl -s http://ENDPOINT_URL/health
# List models
curl -s http://ENDPOINT_URL/v1/models | python3 -m json.tool
# Test completion
curl -s http://ENDPOINT_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "{model-name}",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100
}' | python3 -m json.tool
Report the results to the user.
Step 7: Show Usage Information
After successful deployment, provide the user with everything they need to use the endpoint:
=== DEPLOYMENT COMPLETE ===
Endpoint: http://44.123.456.78:30001
Model: meta-llama/Llama-3.1-8B-Instruct
API: OpenAI-compatible (v1/chat/completions, v1/completions)
USAGE:
Python (OpenAI SDK):
from openai import OpenAI
client = OpenAI(base_url="http://44.123.456.78:30001/v1", api_key="dummy")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
curl:
curl http://44.123.456.78:30001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]}'
MANAGEMENT:
Status: sky serve status {service-name}
Logs: sky serve logs {service-name}
Update: sky serve update {service-name} new-serve.yaml
Scale: Edit replica_policy in YAML and run update
Tear down: sky serve down {service-name}
Step 8: Zero-Downtime Updates
If the user wants to update the model (new checkpoint, config change, etc.), guide them through a rolling update:
- Modify the serving YAML (new model path, new vLLM args, etc.)
- Run:
sky serve update {service-name} new-serve.yaml - SkyServe provisions new replicas with the updated config
- Once new replicas pass the readiness probe, traffic shifts to them
- Old replicas are torn down
sky serve update {service-name} serve-v2.yaml
No downtime. The old replicas continue serving until the new ones are ready.
Security Warning
SkyPilot-exposed ports are PUBLIC by default. There is no built-in authentication. If this endpoint serves sensitive data or is on the public internet:
- Add an API key middleware in front of vLLM
- Use a reverse proxy (nginx, Caddy) with authentication
- Restrict access via cloud security groups (manually, outside SkyPilot)
- For internal-only use, consider SSH tunneling:
ssh -L 8000:localhost:8000 CLUSTER_IP
Flag this warning to the user after every deployment.
Reference
For SkyServe details, YAML spec, and CLI reference, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.