sky-sweep - SKILL.md Agent Skill

name: sky-sweep description: Launch a hyperparameter sweep across cloud GPUs via SkyPilot managed jobs. argument-hint: "[param=values] -- e.g., 'lr=1e-4,3e-4,1e-3 batch=16,32'" allowed-tools: ["Read", "Write", "Edit", "Bash", "Glob"]

Sky Sweep -- Hyperparameter Sweep Launcher

You are a hyperparameter sweep orchestrator that launches parallel training runs across cloud GPUs using SkyPilot managed jobs. You parse sweep parameters, choose a search strategy, generate and launch jobs, track progress, and rank results.

Step 1: Parse Sweep Parameters

If the user provided parameters in their argument, parse them. The expected format is:

param1=val1,val2,val3 param2=val4,val5

Examples:

lr=1e-4,3e-4,1e-3 batch=16,32 -- 6 combinations (grid search)
lr=1e-5:1e-2:log epochs=1,3,5 -- range with log scale
lora_r=8,16,32,64 lora_alpha=16,32,64 -- LoRA hyperparameters

If no parameters were provided, ask the user what they want to sweep. Suggest common sweep targets based on the training task:

For fine-tuning (SFT/LoRA):

learning_rate: 1e-5, 3e-5, 1e-4, 3e-4
lora_r: 8, 16, 32, 64
lora_alpha: 16, 32, 64
num_epochs: 1, 2, 3
per_device_train_batch_size: 4, 8, 16

For pretraining:

learning_rate: 1e-4, 3e-4, 6e-4, 1e-3
warmup_steps: 100, 500, 1000
weight_decay: 0.0, 0.01, 0.1
max_grad_norm: 0.5, 1.0, 2.0

For RLHF/DPO:

beta: 0.05, 0.1, 0.2, 0.5
learning_rate: 1e-6, 5e-6, 1e-5

Also check if a base training YAML already exists in the current directory:

ls *.yaml *.yml 2>/dev/null

If found, read it to understand the training configuration and infer which parameters are sweepable.

Step 2: Choose Sweep Strategy

Based on the number of total combinations, recommend a strategy:

Grid Search (total combinations <= 20)

Enumerate every combination. Best when the search space is small and you want complete coverage.

Grid Search: 3 learning rates x 2 batch sizes = 6 total runs
Estimated cost: 6 x $3.20/hr x 1hr = $19.20

Random Search (total combinations > 20, <= 100)

Sample N random combinations from the full grid. More efficient than grid search for larger spaces.

Random Search: 50 possible combinations, sampling 15 runs
Estimated cost: 15 x $3.20/hr x 1hr = $48.00

Bayesian Optimization via Optuna (total combinations > 100 or continuous ranges)

Use Optuna's TPE sampler to intelligently explore the space. This requires a coordinator script.

Bayesian Search: Continuous space, 20 trials with Optuna TPE
Estimated cost: 20 x $3.20/hr x 1hr = $64.00 (but likely finds optimum faster)

Present the recommendation and let the user confirm or adjust.

Step 3: Locate or Create Base Training YAML

The sweep needs a base SkyPilot YAML to parameterize. Check for existing YAML files:

ls *.yaml *.yml 2>/dev/null

If a base YAML exists, read it and identify where sweep parameters should be injected. Common injection points:

envs section (environment variables the training script reads)
run section (command-line arguments)
Mounted config files

If no base YAML exists, ask the user for their training setup and generate one. Use the same approach as the /sky-launch skill.

Step 4: Generate Sweep Jobs

For Grid/Random Search

Generate a shell script that launches all sweep jobs as SkyPilot managed jobs:

#!/usr/bin/env bash
# Hyperparameter sweep: lr x batch_size
# Generated by /sky-sweep
# Total runs: 6

set -e

SWEEP_ID="sweep-$(date +%Y%m%d-%H%M%S)"

echo "Starting sweep: $SWEEP_ID"
echo "sweep_id,job_name,lr,batch_size" > "${SWEEP_ID}-manifest.csv"

for lr in 1e-4 3e-4 1e-3; do
  for batch in 16 32; do
    JOB_NAME="${SWEEP_ID}-lr${lr}-bs${batch}"
    echo "Launching: $JOB_NAME (lr=$lr, batch=$batch)"

    sky jobs launch train.yaml \
      -n "$JOB_NAME" \
      --env LEARNING_RATE="$lr" \
      --env BATCH_SIZE="$batch" \
      --env SWEEP_ID="$SWEEP_ID" \
      --env WANDB_RUN_NAME="$JOB_NAME" \
      -y

    echo "$SWEEP_ID,$JOB_NAME,$lr,$batch" >> "${SWEEP_ID}-manifest.csv"

    # Brief pause to avoid API rate limits
    sleep 2
  done
done

echo ""
echo "Sweep launched: $SWEEP_ID"
echo "Total jobs: 6"
echo "Monitor with: sky jobs queue"
echo "Manifest: ${SWEEP_ID}-manifest.csv"

Write this script to the current directory and make it executable.

Important considerations:

Each job gets a unique name encoding its hyperparameters for easy identification
Pass hyperparameters via --env so the training script reads them from environment variables
Include WANDB_RUN_NAME for W&B grouping if W&B is configured
Include a SWEEP_ID env var so the training script can group results
Write a manifest CSV for tracking

For Bayesian Optimization (Optuna)

Generate a Python coordinator script that:

Creates an Optuna study
Suggests hyperparameters via TPE
Launches each trial as a SkyPilot managed job
Waits for completion and reads the result
Reports back to Optuna for the next suggestion

#!/usr/bin/env python3
"""Optuna-based hyperparameter sweep via SkyPilot managed jobs."""
import optuna
import subprocess
import json
import time
import re

def objective(trial):
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
    warmup = trial.suggest_int("warmup_steps", 50, 500)

    job_name = f"optuna-trial-{trial.number}"

    # Launch SkyPilot managed job
    cmd = [
        "sky", "jobs", "launch", "train.yaml",
        "-n", job_name,
        "--env", f"LEARNING_RATE={lr}",
        "--env", f"BATCH_SIZE={batch_size}",
        "--env", f"WARMUP_STEPS={warmup}",
        "-y"
    ]
    subprocess.run(cmd, check=True)

    # Wait for job completion and extract metric
    val_loss = wait_and_extract_metric(job_name)
    return val_loss

def wait_and_extract_metric(job_name):
    """Poll job status and extract validation metric from logs."""
    while True:
        result = subprocess.run(
            ["sky", "jobs", "queue", "--name", job_name],
            capture_output=True, text=True
        )
        if "SUCCEEDED" in result.stdout:
            break
        elif "FAILED" in result.stdout:
            return float("inf")  # Pruned trial
        time.sleep(60)

    # Get logs and extract final validation metric
    logs = subprocess.run(
        ["sky", "jobs", "logs", job_name, "--no-follow"],
        capture_output=True, text=True
    ).stdout

    # Parse val_loss or val_bpb from logs
    matches = re.findall(r"val_loss[=:]\s*([\d.]+)", logs)
    if matches:
        return float(matches[-1])
    return float("inf")

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=20)

    print("\nBest trial:")
    print(f"  Value: {study.best_trial.value}")
    print(f"  Params: {study.best_trial.params}")

Write this script and inform the user they need pip install optuna locally.

Step 5: Ensure Training Script Reads Environment Variables

Verify that the training script reads hyperparameters from environment variables. If the base YAML uses a config file instead of env vars, generate a wrapper script:

# run_sweep.sh -- wrapper that injects env vars into config
#!/usr/bin/env bash
# Override config values with environment variables
sed -i "s/learning_rate:.*/learning_rate: ${LEARNING_RATE}/" config.yaml
sed -i "s/per_device_train_batch_size:.*/per_device_train_batch_size: ${BATCH_SIZE}/" config.yaml

# Run training
python train.py --config config.yaml

Or for Python training scripts, ensure they have fallback logic:

import os
lr = float(os.environ.get("LEARNING_RATE", "3e-4"))
batch_size = int(os.environ.get("BATCH_SIZE", "16"))

Step 6: Launch the Sweep

Ask the user to confirm before launching. Present the total cost estimate:

SWEEP SUMMARY:
  Strategy: Grid search
  Parameters: lr (3 values) x batch_size (2 values)
  Total runs: 6
  GPU per run: A100:1 (spot @ $1.20/hr)
  Est. duration per run: 1 hour
  Est. total cost: $7.20

  All 6 jobs will be launched as SkyPilot managed jobs.
  They will run in parallel across available cloud capacity.

  Proceed?

After confirmation, execute the sweep script:

bash sweep-YYYYMMDD-HHMMSS.sh

Step 7: Monitor Sweep Progress

After launching, show the user how to monitor:

# Check all sweep jobs
sky jobs queue

# Stream logs for a specific run
sky jobs logs JOB_NAME

# Watch for completions
watch -n 30 sky jobs queue

If the user asks for a status update, run sky jobs queue and present a summary:

SWEEP PROGRESS: sweep-20260325-143000
  Total: 6 jobs
  Running:   3  (lr=1e-4/bs=16, lr=3e-4/bs=16, lr=1e-3/bs=16)
  Succeeded: 2  (lr=1e-4/bs=32, lr=3e-4/bs=32)
  Pending:   1  (lr=1e-3/bs=32)
  Failed:    0

Step 8: Collect and Rank Results

When all jobs complete, collect results. For each completed job:

sky jobs logs JOB_NAME

Extract the final validation metric (val_loss, val_bpb, accuracy, etc.) from each job's logs.

Present a ranked comparison:

=== SWEEP RESULTS ===
Sweep ID: sweep-20260325-143000
Metric: val_loss (lower is better)

  Rank | Job Name            | lr     | batch | val_loss | Duration | Cost
  -----|---------------------|--------|-------|----------|----------|------
  1    | sweep-lr3e-4-bs16   | 3e-4   | 16    | 1.234    | 58m      | $1.16
  2    | sweep-lr1e-4-bs32   | 1e-4   | 32    | 1.289    | 52m      | $1.04
  3    | sweep-lr1e-4-bs16   | 1e-4   | 16    | 1.312    | 61m      | $1.22
  4    | sweep-lr3e-4-bs32   | 3e-4   | 32    | 1.345    | 49m      | $0.98
  5    | sweep-lr1e-3-bs16   | 1e-3   | 16    | 1.567    | 55m      | $1.10
  6    | sweep-lr1e-3-bs32   | 1e-3   | 32    | 1.892    | 47m      | $0.94

  BEST CONFIG: lr=3e-4, batch_size=16 (val_loss=1.234)

  Total sweep cost: $6.44

Recommend the best configuration and suggest next steps:

Launch a full training run with the best config: /sky-launch
Run a finer sweep around the best values
Evaluate the best checkpoint: /sky-eval

Reference

For YAML spec and managed job details, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.