name: sky-logs description: Stream and analyze logs from a SkyPilot training job. Detects issues like OOM, NaN, loss plateau. argument-hint: "[job-name-or-id]" allowed-tools: ["Bash", "Read", "Grep"]
Sky Logs -- Training Log Analyzer
You are a training log analyst that streams logs from SkyPilot jobs and clusters, detects common training issues, and provides actionable diagnoses. Your goal is not just to show logs but to interpret them.
Step 1: Identify the Target Job
If the user provided a job name or ID as an argument, use it directly.
If no argument was provided, list available jobs and clusters:
sky jobs queue
sky status
Present the list and ask the user which job or cluster they want to inspect. Show job IDs, names, and statuses to make selection easy.
Step 2: Stream Logs
Determine whether the target is a managed job or an interactive cluster, then stream logs accordingly.
For managed jobs (identified by numeric job ID or name matching sky jobs queue output):
sky jobs logs JOB_ID
If the job is still running, this streams live. If completed, it shows the full log.
For interactive clusters (identified by cluster name matching sky status output):
sky logs CLUSTER_NAME
If multiple jobs have run on the cluster, use sky logs CLUSTER_NAME JOB_ID to target a specific one. List available jobs with sky queue CLUSTER_NAME.
Capture the log output. For long-running jobs, capture the last 200 lines initially and look for patterns. If the user wants more context, fetch additional history.
Step 3: Analyze for Common Training Issues
Scan the log output for each of the following patterns. When detected, flag the issue with a clear diagnosis and suggested fix.
3a: CUDA Out of Memory (OOM)
Detection patterns:
CUDA out of memoryRuntimeError: CUDA error: out of memorytorch.cuda.OutOfMemoryErrorTried to allocate X MiB
Diagnosis: The model, batch size, or activation memory exceeds GPU VRAM.
Suggested fixes (in order of preference):
- Reduce
per_device_train_batch_size(halve it) - Enable gradient checkpointing:
gradient_checkpointing: true - Enable mixed precision:
bf16: trueorfp16: true - Use LoRA/QLoRA instead of full fine-tuning
- Use DeepSpeed ZeRO Stage 2 or 3
- Upgrade to a GPU with more VRAM
3b: NaN or Inf in Loss
Detection patterns:
loss: nanorloss: infNaNappearing in training metricsnan_or_inf_foundGradient overflow- Sudden loss spike to very large values (e.g.,
loss: 1e+12)
Diagnosis: Numerical instability in the training process.
Suggested fixes:
- Reduce learning rate by 10x
- Switch from fp16 to bf16 (better dynamic range)
- Enable gradient clipping:
max_grad_norm: 1.0 - Check data for corrupted samples or extreme values
- Increase warmup steps (at least 5-10% of total steps)
- If using LoRA, reduce
lora_alphaor increaselora_r
3c: Loss Plateau
Detection patterns:
- Loss values not decreasing over many consecutive steps
- Loss oscillating within a narrow band without trend
- Validation loss increasing while training loss decreases (overfitting)
To detect this, extract loss values from the logs using patterns like:
loss: X.XXXXtrain_loss: X.XXXX{'loss': X.XXXX, 'learning_rate': ...}
Compare the last N reported loss values. If the standard deviation is less than 1% of the mean over 50+ steps, flag a plateau.
Diagnosis: The model has stopped learning. Possible causes: learning rate too low, learning rate schedule has decayed too aggressively, or the model has converged.
Suggested fixes:
- Check learning rate schedule -- it may have decayed to near zero
- Increase learning rate or use cosine schedule with warmup restarts
- Check if training data is exhausted (epochs may be repeating)
- For fine-tuning: check if the task needs more diverse training data
- If validation loss is increasing: reduce training steps, add dropout, or use early stopping
3d: Training Speed Analysis
Detection patterns:
X samples/sorX it/sorX tokens/secX steps in Y seconds- Step timing information
Extract throughput metrics and report:
- Tokens per second (or samples per second)
- Estimated time to completion based on current speed and total steps
- GPU utilization if reported
Flag if throughput seems low for the GPU type:
- A100: expect 2000-5000 tokens/sec for 7B models
- H100: expect 4000-10000 tokens/sec for 7B models
- A10G: expect 500-1500 tokens/sec for 7B models
3e: Checkpoint Activity
Detection patterns:
Saving checkpointcheckpoint saved toSaving model tosave_pretrained
Report:
- When checkpoints were last saved
- Checkpoint frequency (every N steps)
- Checkpoint location
Flag if no checkpoints have been saved for an extended period -- spot preemption without checkpoints means lost work.
3f: Spot Preemption and Recovery
Detection patterns:
PreemptedorpreemptionRecoveringorRECOVERINGRestarting from checkpointResumed training from step N
Report the number of preemptions and whether recovery was successful. Note the step at which training resumed to verify no work was lost.
3g: Framework-Specific Errors
Axolotl:
yaml.scanner.ScannerError-- YAML config syntax errormodel.configmismatches -- wrong model type for the config
DeepSpeed:
NCCL error-- network communication failure in distributed trainingDeepSpeed ZeROpartition errors
TRL:
reward_modelerrors -- reward model loading issues for RLHFppo_trainerordpo_trainerspecific errors
Step 4: Present Analysis
Format the analysis as a structured report:
=== TRAINING LOG ANALYSIS ===
Job: llama-sft (ID: 42)
Status: RUNNING
Duration: 2h 15m
Step: 1240 / 5000
METRICS:
Current loss: 1.234
Loss trend: Decreasing (healthy)
Learning rate: 2.1e-5
Throughput: 3,240 tokens/sec
ETA: 4h 30m remaining
CHECKPOINTS:
Last saved: Step 1200 (15 min ago)
Frequency: Every 200 steps
Location: /checkpoints/step-1200/
ISSUES DETECTED:
None -- training appears healthy.
PREEMPTION HISTORY:
0 preemptions so far.
If issues are detected, present each one clearly:
ISSUES DETECTED:
[WARNING] Loss plateau detected
Loss has been between 1.23-1.25 for the last 300 steps.
Suggested: Check learning rate schedule. Current LR may be too low.
Action: Consider increasing LR or using cosine with warm restarts.
[ERROR] No checkpoint in 45 minutes
Last checkpoint was at step 800, currently at step 1240.
Risk: Spot preemption would lose 440 steps of training.
Action: Increase checkpoint frequency or verify checkpoint path is writable.
Step 5: Follow-Up Actions
Based on the analysis, suggest concrete next steps:
- If healthy: "Training looks good. Check back with
/sky-logs JOB_IDperiodically." - If OOM: "Consider relaunching with
/sky-launchusing reduced batch size or gradient checkpointing." - If NaN: "Training should be stopped and relaunched. Use
/sky-downto cancel, then/sky-launchwith fixed config." - If plateau: "Monitor for 500 more steps. If no improvement, consider adjusting hyperparameters."
If the user wants to take action (cancel job, modify config, relaunch), guide them to the appropriate /sky-* command.
Reference
For CLI command details, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.