name: monitor-run description: Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.
Monitor a run
Runbook
On launch
- Find the output dir and read the resolved configs at
{output_dir}/configs/(start withrl.toml). - Confirm all processes are alive and the run is making progress.
- Write the initial summary into
{output_dir}/STATUS.md.
Recurring check-ins
Default cadence: 1 hour (researcher can override). At each check-in:
- Confirm processes are alive.
- Grep logs for errors/warnings; note current step and key metrics.
- Append an entry to
{output_dir}/STATUS.md(never overwrite):
## YYYY-MM-DD HH:MM UTC
**Step**: {current_step} / {max_steps}
**Health**: {Healthy | Degraded | Down}
**Progress**: reward/mean, seq_len, truncation, eval scores, env-specific metrics.
**Stability**: entropy, mismatch_kl, grad_norm — flag spikes.
**Performance**: trainer vs orchestrator step time, env lag, inference pressure.
**Notes**: anything unusual (errors, restarts, hangs). Omit if nothing notable.
Restarting a run
Never restart unless the researcher explicitly asked. Confirm the exact restart command and the conditions that warrant one.
Never run kill or launch commands from your own shell. Dispatch them to the tmux Launcher window so the researcher sees what was executed:
SESSION=$(tmux display-message -p '#S')
tmux send-keys -t "$SESSION:Launcher" 'your command here' Enter
After a restart, verify all processes are back up and progress resumed before the next check-in.
Reference
Where to find things
scripts/tmux.shlaunches the run with aLauncherwindow in the named tmux session. The Claude window receives the output dir and session name in its appended prompt — if either is missing, ask rather than guess.{output_dir}/configs/— resolved TOMLs (rl.tomlhas the full picture).{output_dir}/logs/— see below.{output_dir}/rollouts/step_N/— saved rollouts.
Logs
{output_dir}/logs/
├── trainer.log # rank 0 stdout
├── orchestrator.log # orchestrator stdout
├── inference.log # vLLM stdout
├── trainer/
│ ├── node_*.log # per-node (multi-node only)
│ └── torchrun/ # per-rank stdout/stderr
├── inference/
│ ├── node_*.log # per-node (multi-node only)
│ └── router_0.log # vllm-router per replica (multi-node only)
└── envs/{train,eval}/{env_name}.log # one log file per env
Usually tailing trainer.log, orchestrator.log, and inference.log is enough. Drop into per-node or per-rank logs only when debugging. All logs are loguru with HH:mm:ss LEVEL message; levels: DEBUG, INFO, SUCCESS, WARNING, ERROR.
Scan for problems:
grep -E "WARNING|ERROR" {output_dir}/logs/{trainer,orchestrator,inference}.log
grep -E "WARNING|ERROR" {output_dir}/logs/envs/{train,eval}/*.log
Metrics
All metrics print to the console log (and W&B when configured).
Progress — orchestrator log:
| Metric | Description |
|---|---|
reward/{all,env}/mean |
mean training reward |
seq_len/{all,env}/mean |
avg sequence length (tokens) |
num_turns/{all,env}/mean |
avg turns per rollout (multi-turn only) |
is_truncated/{all,env}/mean |
fraction truncated |
empty_rollouts/{all,env}, errored_rollouts/{all,env} |
fraction empty/errored |
metrics/{env}/{metric} |
env-specific (e.g. pass rate) |
eval/{env}/{avg@k,pass@k} |
eval scores when configured |
Stability — trainer log:
| Metric | Description |
|---|---|
mismatch_kl/{all,env}/{mean,std,max} |
KL between trainer and (old) inference policy over trainable tokens |
entropy/{all,env}/{mean,std,max} |
policy entropy over trainable tokens |
masked_advantage_{positive,negative}/mean |
fraction of DPPO-masked tokens with +/- advantage |
optim/grad_norm |
spikes may precede divergence |
Performance — trainer and orchestrator step independently, so comparing step times shows who's waiting on whom.
| Source | Metric | Description |
|---|---|---|
| trainer | time/step |
total trainer step |
| trainer | time/wait_for_batch |
high → orchestrator is bottleneck |
| trainer | time/forward_backward, time/broadcast_weights, time/save_ckpt |
phase timings |
| trainer | perf/throughput, perf/mfu |
tokens/s and MFU % |
| orchestrator | time/step, time/generate_completions, time/update_weights |
phase timings |
| orchestrator | time/wait_for_ckpt |
high → trainer is bottleneck |
| orchestrator | scheduler/async_level, scheduler/inflight_rollouts |
scheduler state |
| env server | event loop lag (min/mean/p90/p99/max), active task distribution | periodic |
For live vLLM stats, query Prometheus directly:
curl -s http://localhost:8000/metrics | grep -E "num_requests|gpu_cache_usage"
# vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc (→1.0 = KV cache saturated)
Rollouts
{output_dir}/rollouts/step_N/
├── train_rollouts.jsonl # all train rollouts (vf.RolloutOutput, trajectory excluded)
├── eval_rollouts.jsonl # only present when eval ran
└── train_rollouts.bin # binary batch consumed by the trainer
wc -l {output_dir}/rollouts/step_42/train_rollouts.jsonl
head -1 {output_dir}/rollouts/step_42/train_rollouts.jsonl | python -m json.tool
jq '.reward' {output_dir}/rollouts/step_42/train_rollouts.jsonl
Common failure modes
A few warnings are normal. Escalate when errors are persistent, growing, or hit a large fraction of rollouts.
- Env workers: exceptions in env code, timeouts, sandbox errors, OOM kills (most common source — runs user code).
- Orchestrator: empty/errored rollout spikes, weight-broadcast failures, checkpoint errors.
- Trainer: NCCL/CUDA errors, OOM, NaN loss or gradients.
- Inference: NCCL/CUDA errors, OOM, request timeouts.
Process tree
All processes use setproctitle so they're visible in ps/htop/pstree:
PRIME-RL::Launcher
├── PRIME-RL::Inference (vLLM server, GPU 0)
├── PRIME-RL::Orchestrator (CPU-only)
│ └── Verifiers::EnvServer (ZMQ env server per environment)
│ └── Verifiers::EnvWorker0..N
├── torchrun
│ └── PRIME-RL::Trainer (GPU 1+)
└── tail trainer.log
For multi-node runs, trainer and inference processes are on separate nodes — use srun or ssh to inspect them.