name: monitor-experiment description: Poll a running W&B training run for progress and emit structured alerts
Monitor Experiment
Purpose
Continuously (or on-demand) check a running experiment's W&B metrics and emit alerts for anomalies. Supports the "30-minute quality check" paradigm: after the first 30 minutes of a long training run, produce a checkpoint quality report before committing more resources.
Prerequisites
WANDB_API_KEYis set in the environment.- The experiment is actively logging to W&B (not in
WANDB_MODE=offline). - For offline mode: read from local
wandb-summary.jsoninstead.
Inputs
| Parameter | Required | Description |
|---|---|---|
run_id |
Yes* | W&B run ID (e.g., entity/project/run_id) |
output_dir |
Yes* | Local output directory (for offline mode fallback) |
poll_interval |
No | Seconds between polls (default: 60) |
alert_on |
No | List of alert conditions to enable (default: all) |
* One of run_id or output_dir is required.
Steps
1. Connect to the run
Online mode (preferred):
import wandb
api = wandb.Api()
run = api.run("<run_id>")
Offline fallback:
import json
summary_path = f"{output_dir}/tracker/wandb/latest-run/files/wandb-summary.json"
with open(summary_path) as f:
summary = json.load(f)
2. Track key metrics
| Metric | W&B Key | Description |
|---|---|---|
| Training loss | train_loss |
Primary training loss |
| Gradient norm | grad_norm |
Gradient magnitude |
| Step time | step_time |
Wall-clock seconds per step |
| Learning rate | learning_rate |
Current LR |
| Avg step time | avg_step_time |
Running average step time |
| Validation videos | validation_videos_* |
Generated validation samples |
3. Evaluate alert conditions
| Alert | Condition | Severity |
|---|---|---|
| Loss spike | current_loss > 3 × rolling_avg_loss |
🔴 Critical |
| NaN/Inf gradient | grad_norm is NaN or Inf |
🔴 Critical |
| Step time regression | step_time > 2 × baseline_step_time |
🟡 Warning |
| No progress | No new W&B logs for > 10 minutes | 🟡 Warning |
| Loss plateau | Loss change < 1% over last 100 steps | 🟢 Info |
4. Emit structured status
Output format (agent-consumable):
{
"run_id": "...",
"step": 500,
"metrics": {
"train_loss": 0.078,
"grad_norm": 0.41,
"step_time": 2.5,
"learning_rate": 1e-6
},
"alerts": [
{"type": "loss_spike", "severity": "critical", "message": "Loss jumped to 0.45 (avg: 0.08)"}
],
"status": "running"
}
5. 30-Minute Quality Check
After the first 30 minutes of wall-clock time:
- Summarize the loss curve shape (decreasing? at what rate?).
- Check if validation videos have been generated.
- Report step count, loss at start vs. current, and estimated time to completion.
- Produce a go/no-go recommendation.
## 30-Minute Check: <run_name>
- **Steps completed**: 150
- **Loss**: 0.12 → 0.08 (↓ 33%)
- **Grad norm**: stable at ~0.4
- **Step time**: 2.5s/step (consistent)
- **Validation videos**: 5 generated at step 100
- **Recommendation**: ✅ Continue — loss is decreasing normally
Outputs
- Structured JSON status updates.
- Alert messages for anomalous conditions.
- 30-minute checkpoint quality report.
Example Usage
Monitor W&B run "fastvideo/Wan_distillation/abc123":
run_id: fastvideo/Wan_distillation/abc123
poll_interval: 120
alert_on: [loss_spike, nan_gradient, step_time_regression]
References
fastvideo/training/trackers.py—WandbTrackerimplementationfastvideo/tests/training/Vanilla/test_training_loss.py— how summaries are comparedfastvideo/tests/training/Vanilla/a40_reference_wandb_summary.json— reference summary format
Changelog
| Date | Change |
|---|---|
| 2026-03-02 | Initial version |