cost-optimization - SKILL.md Agent Skill

name: cost-optimization description: Use when reducing cloud GPU costs, choosing spot vs on-demand, estimating training expenses, managing budgets, configuring auto-stop, or optimizing spend across clouds - covers SkyPilot optimizer, spot failover, pricing tiers, and budget management

Cloud Cost Optimization

Overview

Minimize cloud GPU spend without sacrificing reliability. SkyPilot's optimizer automatically finds the cheapest available instance across all configured clouds, and its spot/failover system provides 3-6x savings with automatic recovery.

Core principle: Never pay on-demand prices for interruptible workloads. Use spot instances with checkpointing and failover, and auto-stop everything.

When to Use

Launching any GPU job (training, eval, inference)
Estimating cost before committing to a training run
Budget is constrained and you need to reduce spend
Choosing between spot and on-demand
Preventing idle cluster waste
Comparing GPU pricing across clouds

Do not use for:

Latency-critical serving (use cost-optimization only for batch/training)
Jobs under 5 minutes (spot preemption overhead not worth it)

SkyPilot Optimizer

SkyPilot automatically ranks all (cloud, region, instance) triples by hourly price and selects the cheapest available option.

# Let SkyPilot pick the cheapest cloud/region
resources:
  accelerators: A100:1
  # No cloud specified = search all configured clouds

# See what SkyPilot would pick and the price
sky launch --dryrun job.yaml

# List pricing for a specific GPU across all clouds
sky gpus --all A100:8

# Check actual spend
sky cost-report

Spot Instances: 3-6x Savings

Spot instances are preemptible but dramatically cheaper. SkyPilot handles preemption recovery automatically.

resources:
  accelerators: H100:8
  use_spot: true

When to use spot:

Training with checkpointing (can resume after preemption)
Evaluation jobs (stateless, can restart)
Hyperparameter sweeps (many short jobs)
Any job that writes progress to persistent storage

When NOT to use spot:

Live serving endpoints (use on-demand or reserved)
Jobs under 5 minutes (recovery overhead)
Demos or time-critical deadlines

Mixed Failover Pattern

The most cost-effective pattern: try spot across clouds, fall back to on-demand only if all spot is unavailable.

resources:
  any_of:
    # Try spot on cheapest clouds first
    - accelerators: H100:8
      use_spot: true
      cloud: lambda
    - accelerators: H100:8
      use_spot: true
      cloud: aws
    - accelerators: H100:8
      use_spot: true
      cloud: gcp
    # On-demand fallback (guaranteed availability)
    - accelerators: H100:8
      use_spot: false
      cloud: lambda

Multi-GPU fallover (different GPU types):

resources:
  any_of:
    - accelerators: H100:8
      use_spot: true
    - accelerators: A100-80GB:8
      use_spot: true
    - accelerators: A100:8
      use_spot: true
    - accelerators: H100:8
      use_spot: false     # on-demand fallback

Auto-Stop: Prevent Waste

The single most impactful cost optimization. Idle clusters are pure waste.

# Stop cluster after 30 minutes idle
sky autostop mycluster -i 30

# Tear down cluster after 30 minutes idle (no storage cost)
sky autostop mycluster -i 30 --down

# Set auto-stop at launch time
sky launch job.yaml --idle-minutes-to-autostop 30

# Set auto-stop with teardown at launch time
sky launch job.yaml --idle-minutes-to-autostop 30 --down

In YAML:

resources:
  accelerators: A100:1
  # Auto-stop is not in YAML -- pass via CLI flags

Rules of thumb:

Training jobs: --idle-minutes-to-autostop 10 --down (tear down when done)
Development clusters: --idle-minutes-to-autostop 30 (stop but keep disk)
Batch eval: --idle-minutes-to-autostop 5 --down (tear down fast)

Cost Estimation

Always estimate before launching long runs.

# Dry run shows instance selection and hourly price
sky launch --dryrun job.yaml

# List all GPU options with pricing
sky gpus --all

# Filter by GPU type
sky gpus --all H100

# Estimate total cost for a training run
# Formula: hourly_rate * estimated_hours * (1 + overhead_buffer)
# Example: $2.50/hr * 24hr * 1.1 = $66

Quick cost calculator:

Total cost = hourly_rate * hours * 1.1 (10% buffer for setup/teardown)

Spot savings = on_demand_rate * 0.25 to 0.35 (spot is 65-75% cheaper)

Multi-node cost = per_node_rate * num_nodes * hours * 1.1

GPU Pricing Tiers (Approximate, 2026)

For detailed per-cloud pricing, see references/pricing-guide.md.

GPU	Spot ($/hr)	On-Demand ($/hr)	Best For
H100 80GB	$2.00-4.00	$8.00-12.00	Large model training, fastest throughput
A100 80GB	$1.00-2.00	$4.00-6.00	Standard training, good price/performance
A100 40GB	$0.80-1.50	$3.00-5.00	Models up to ~30B, inference
L40S 48GB	$0.80-1.50	$2.50-4.00	Training + inference hybrid
A10G 24GB	$0.40-0.80	$1.50-2.50	Fine-tuning, small models, eval
T4 16GB	$0.15-0.35	$0.50-1.00	Inference, small fine-tuning

Cloud-specific cost advantages:

Lambda Cloud: Often cheapest for H100/A100 on-demand
AWS: Best spot availability, largest region selection
GCP: Good A100 spot pricing, TPU option
RunPod/Vast.ai: Cheapest for short bursts (via SkyPilot)

Checkpointing Frequency vs Cost

More frequent checkpoints = less wasted work on preemption, but more I/O overhead.

Optimal checkpoint interval = sqrt(2 * mean_time_between_preemptions * checkpoint_cost)

Rules of thumb:

Spot on AWS/GCP: checkpoint every 15-30 minutes
Spot on Lambda: checkpoint every 30-60 minutes (less preemption)
On-demand: checkpoint every 1-2 hours (insurance only)
Async checkpointing: free to checkpoint more often (no training stall)

Budget Management

Daily cost cap pattern:

# Check spend before launching
sky cost-report

# Launch with awareness of cumulative cost
# No built-in budget cap in SkyPilot -- implement externally:
DAILY_BUDGET=100
CURRENT_SPEND=$(sky cost-report --json | python3 -c "import json,sys; print(sum(j['cost'] for j in json.load(sys.stdin)['clusters']))")
if (( $(echo "$CURRENT_SPEND > $DAILY_BUDGET" | bc -l) )); then
  echo "BUDGET EXCEEDED: $CURRENT_SPEND > $DAILY_BUDGET"
  exit 1
fi
sky launch job.yaml

Cost-aware experiment design:

Run fast sanity checks on cheap GPUs (T4/A10G) before full training on H100
Use --limit 100 for eval debugging, full eval only for final checkpoints
Sweep hyperparameters with short runs (1000 steps) before full training
Use mixed failover to catch cheap spot when available

For advanced cost patterns (preemption-aware scheduling, multi-job orchestration, reserved instance strategy), see references/cost-patterns.md.

Common Mistakes

Mistake	Fix
Forgetting auto-stop	Always pass `--idle-minutes-to-autostop`. Set a shell alias.
On-demand for training	Use spot. Checkpointing makes preemption free.
Not using `--dryrun`	Always check price before launching.
Single-cloud lock-in	Configure multiple clouds. SkyPilot finds cheapest.
No checkpointing with spot	Preemption = lost work. Always checkpoint.
Over-provisioning GPU	Use smallest GPU that fits your model. A10G for 7B fine-tune, not H100.
Forgetting `--down`	Stopped clusters still cost for disk. Use `--down` for batch jobs.

Quick Reference

# See cheapest option for your job
sky launch --dryrun job.yaml

# Launch on cheapest spot
sky launch job.yaml --use-spot --idle-minutes-to-autostop 10 --down

# List running clusters and their costs
sky status

# Cost report
sky cost-report

# Tear down all clusters
sky down --all

# GPU pricing
sky gpus --all H100