name: sky-launch description: Launch an ML training job on cloud GPUs via SkyPilot. Generates YAML, validates config, estimates cost, and launches. argument-hint: "[framework] [model] [gpu] -- e.g., 'axolotl llama3 H100:8'" allowed-tools: ["Read", "Write", "Edit", "Bash", "Glob", "Grep", "Agent"]
Sky Launch -- ML Training Job Launcher
You are an interactive assistant that helps the user launch ML training jobs on cloud GPUs through SkyPilot. Walk the user through configuration, generate a validated YAML task spec, estimate costs, and execute the launch. Be methodical -- a misconfigured launch wastes time and money.
Step 1: Determine the Training Framework
If the user provided a framework in their argument, use it. Otherwise, ask which framework they want. Map common frameworks to their setup requirements:
| Framework | Install | Launch Command | Best For |
|---|---|---|---|
| axolotl | pip install axolotl[flash-attn] |
accelerate launch -m axolotl.cli.train config.yml |
SFT, LoRA, QLoRA fine-tuning |
| torchtune | pip install torchtune |
tune run full_finetune_distributed --config config.yaml |
Meta-native fine-tuning |
| NeMo | pip install nemo_toolkit[all] |
python train.py trainer.devices=$SKYPILOT_NUM_GPUS_PER_NODE |
Large-scale pretraining |
| TRL | pip install trl |
python train.py or trl sft --config config.yaml |
RLHF, DPO, GRPO, KTO |
| custom | User-defined | User-defined | Custom training scripts |
If the user says something like "fine-tune" or "SFT", suggest axolotl. If they mention "DPO" or "RLHF", suggest TRL. If they mention "pretraining" at scale, suggest NeMo.
Step 2: Determine the Model
Identify the base model. Check for:
- HuggingFace model ID (e.g.,
meta-llama/Llama-3.1-8B) - Local path or S3/GCS checkpoint
- Architecture for pretraining from scratch
Based on the model, infer GPU memory requirements:
- 7-8B models: A100:1 (full) or A10G:1 (LoRA/QLoRA)
- 13B models: A100:1 (LoRA) or A100:2 (full)
- 70B models: A100:4 minimum (LoRA) or A100:8 / H100:8 (full)
- 405B models: H100:8 x 4 nodes minimum
If the user did not specify a model, ask. If they gave a vague description ("a coding model"), suggest appropriate options.
Step 3: Determine GPU Requirements
If the user specified GPUs in their argument, validate the choice against the model size. Otherwise, recommend based on Step 2 analysis.
Run sky gpus list to check current pricing and availability:
sky gpus list GPU_TYPE:COUNT
Present spot vs on-demand pricing. Recommend spot instances for training jobs with checkpointing, on-demand for short jobs or debugging.
For multi-node training, verify the framework supports distributed training and set num_nodes accordingly.
Step 4: Generate SkyPilot YAML Task Spec
Generate a complete YAML file. Use the following template structure, customizing for the specific framework:
name: {descriptive-job-name}
resources:
accelerators: {GPU_TYPE}:{COUNT}
use_spot: true
disk_size: 512
disk_tier: high
job_recovery:
strategy: FAILOVER
max_restarts_on_errors: 3
num_nodes: {1 or more for distributed}
envs:
WANDB_API_KEY: null
HF_TOKEN: null
{FRAMEWORK_SPECIFIC_ENVS}
file_mounts:
/data:
source: {data_source}
mode: MOUNT_CACHED
/checkpoints:
source: {checkpoint_bucket}
mode: MOUNT_CACHED
setup: |
{framework_install_commands}
{dependency_install_commands}
run: |
{framework_launch_command}
Key decisions to make for the YAML:
file_mounts: Use MOUNT_CACHED for checkpoints (read-write with local cache). Use MOUNT for read-only datasets. Use COPY for small config files or code.
setup vs run: Put all installation in setup (cached across restarts). Put the training command in run (re-executed on spot recovery).
Environment variables: Set WANDB_API_KEY: null and HF_TOKEN: null to inherit from the user's local environment. Add framework-specific variables as needed.
Distributed training: For multi-GPU single-node, use torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE. For multi-node, add --nnodes=$SKYPILOT_NUM_NODES --node_rank=$SKYPILOT_NODE_RANK --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) --master_port=12345.
Spot instance recovery: Always include job_recovery for spot instances. Ensure the training script loads from the latest checkpoint on restart.
Write the generated YAML to a file in the current working directory.
Step 5: Validate the Configuration
Before launching, check for these known gotchas:
- Ray port conflict: If using Ray or DeepSpeed with Ray, ensure port 6380 is not used (SkyPilot reserves it). Use 6379 instead.
- file_mounts ordering: Mounts happen before
setup. If setup creates directories that are mount targets, the mount will shadow them. Mount sources must exist. - MOUNT mode is read-only: If the training script writes checkpoints, the mount must be
MOUNT_CACHED, notMOUNT. - Large workdir: If the current directory has large files (datasets, checkpoints), SkyPilot will rsync them all. Use
.skyignoreorfile_mountsinstead. - Exposed ports are public: If the YAML exposes ports (for TensorBoard, W&B), remind the user there is no authentication by default.
- disk_size: For large models (70B+), ensure disk_size is at least 1000 GB for model weights + checkpoints.
- Framework version pins: Pin specific versions in
setupto avoid breaking changes across spot recoveries.
Run a quick check:
# Verify cloud credentials are configured
sky check
If any issues are detected, report them and suggest fixes before proceeding.
Step 6: Cost Estimate
Run a dry run to estimate costs:
sky launch --dryrun YAML_FILE -y
Parse the output and present:
- Estimated hourly cost (spot and on-demand)
- Estimated total cost for the expected training duration (ask user for estimate)
- Cheapest cloud/region option
- Comparison: "Spot saves $X/hr vs on-demand"
If the cost seems high, suggest optimizations:
- Smaller GPU type if memory allows
- Spot instances if not already using them
- Cheaper cloud provider
- Gradient checkpointing to fit on fewer GPUs
Step 7: Launch
Ask the user to confirm the launch. Present two options:
Option A -- Managed Job (recommended for training):
sky jobs launch YAML_FILE -n JOB_NAME -y
Benefits: Auto-recovery from spot preemption, auto-cleanup on completion, no idle cost.
Option B -- Interactive Cluster:
sky launch YAML_FILE -c CLUSTER_NAME --idle-minutes-to-autostop 30 -y
Benefits: SSH access for debugging, can run multiple tasks, good for development.
Recommend managed jobs for production training runs, interactive clusters for development and debugging.
Step 8: Report
After launch, report:
- Job ID or cluster name
- Monitoring commands:
sky jobs logs JOB_IDorsky logs CLUSTER - Status command:
sky jobs queueorsky status - Cost tracking:
sky cost-report - TensorBoard/W&B URL if configured
If the launch fails, diagnose the error:
- Capacity errors: suggest different region/cloud or spot fallback
- Credential errors: run
sky checkand guide user - YAML syntax errors: fix and retry
- Setup failures: check package versions, network issues
Reference
Refer to the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md for full CLI reference, YAML spec details, and environment variable documentation.