monitor-run - SKILL.md Agent Skill

name: monitor-run description: Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.

Monitor a run

Runbook

On launch

Find the output dir and read the resolved configs at {output_dir}/configs/ (start with rl.toml).
Confirm all processes are alive and the run is making progress.
Write the initial summary into {output_dir}/STATUS.md.

Recurring check-ins

Default cadence: 1 hour (researcher can override). At each check-in:

Confirm processes are alive.
Grep logs for errors/warnings; note current step and key metrics.
Append an entry to {output_dir}/STATUS.md (never overwrite):

## YYYY-MM-DD HH:MM UTC

**Step**: {current_step} / {max_steps}
**Health**: {Healthy | Degraded | Down}

**Progress**: reward/mean, seq_len, truncation, eval scores, env-specific metrics.
**Stability**: entropy, mismatch_kl, grad_norm — flag spikes.
**Performance**: trainer vs orchestrator step time, env lag, inference pressure.

**Notes**: anything unusual (errors, restarts, hangs). Omit if nothing notable.

Restarting a run

Never restart unless the researcher explicitly asked. Confirm the exact restart command and the conditions that warrant one.

Never run kill or launch commands from your own shell. Dispatch them to the tmux Launcher window so the researcher sees what was executed:

SESSION=$(tmux display-message -p '#S')
tmux send-keys -t "$SESSION:Launcher" 'your command here' Enter

After a restart, verify all processes are back up and progress resumed before the next check-in.

Reference

Where to find things

scripts/tmux.sh launches the run with a Launcher window in the named tmux session. The Claude window receives the output dir and session name in its appended prompt — if either is missing, ask rather than guess.
{output_dir}/configs/ — resolved TOMLs (rl.toml has the full picture).
{output_dir}/logs/ — see below.
{output_dir}/rollouts/step_N/ — saved rollouts.

Logs

{output_dir}/logs/
├── trainer.log                # rank 0 stdout
├── orchestrator.log           # orchestrator stdout
├── inference.log              # vLLM stdout
├── trainer/
│   ├── node_*.log             # per-node (multi-node only)
│   └── torchrun/              # per-rank stdout/stderr
├── inference/
│   ├── node_*.log             # per-node (multi-node only)
│   └── router_0.log           # vllm-router per replica (multi-node only)
└── envs/{train,eval}/{env_name}.log    # one log file per env

Usually tailing trainer.log, orchestrator.log, and inference.log is enough. Drop into per-node or per-rank logs only when debugging. All logs are loguru with HH:mm:ss LEVEL message; levels: DEBUG, INFO, SUCCESS, WARNING, ERROR.

Scan for problems:

grep -E "WARNING|ERROR" {output_dir}/logs/{trainer,orchestrator,inference}.log
grep -E "WARNING|ERROR" {output_dir}/logs/envs/{train,eval}/*.log

Metrics

All metrics print to the console log (and W&B when configured).

Progress — orchestrator log:

Metric	Description
`reward/{all,env}/mean`	mean training reward
`seq_len/{all,env}/mean`	avg sequence length (tokens)
`num_turns/{all,env}/mean`	avg turns per rollout (multi-turn only)
`is_truncated/{all,env}/mean`	fraction truncated
`empty_rollouts/{all,env}`, `errored_rollouts/{all,env}`	fraction empty/errored
`metrics/{env}/{metric}`	env-specific (e.g. pass rate)
`eval/{env}/{avg@k,pass@k}`	eval scores when configured

Stability — trainer log:

Metric	Description
`mismatch_kl/{all,env}/{mean,std,max}`	KL between trainer and (old) inference policy over trainable tokens
`entropy/{all,env}/{mean,std,max}`	policy entropy over trainable tokens
`masked_advantage_{positive,negative}/mean`	fraction of DPPO-masked tokens with +/- advantage
`optim/grad_norm`	spikes may precede divergence

Performance — trainer and orchestrator step independently, so comparing step times shows who's waiting on whom.

Source	Metric	Description
trainer	`time/step`	total trainer step
trainer	`time/wait_for_batch`	high → orchestrator is bottleneck
trainer	`time/forward_backward`, `time/broadcast_weights`, `time/save_ckpt`	phase timings
trainer	`perf/throughput`, `perf/mfu`	tokens/s and MFU %
orchestrator	`time/step`, `time/generate_completions`, `time/update_weights`	phase timings
orchestrator	`time/wait_for_ckpt`	high → trainer is bottleneck
orchestrator	`scheduler/async_level`, `scheduler/inflight_rollouts`	scheduler state
env server	event loop lag (min/mean/p90/p99/max), active task distribution	periodic

For live vLLM stats, query Prometheus directly:

curl -s http://localhost:8000/metrics | grep -E "num_requests|gpu_cache_usage"
# vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc (→1.0 = KV cache saturated)

Rollouts

{output_dir}/rollouts/step_N/
├── train_rollouts.jsonl   # all train rollouts (vf.RolloutOutput, trajectory excluded)
├── eval_rollouts.jsonl    # only present when eval ran
└── train_rollouts.bin     # binary batch consumed by the trainer

wc -l {output_dir}/rollouts/step_42/train_rollouts.jsonl
head -1 {output_dir}/rollouts/step_42/train_rollouts.jsonl | python -m json.tool
jq '.reward' {output_dir}/rollouts/step_42/train_rollouts.jsonl

Common failure modes

A few warnings are normal. Escalate when errors are persistent, growing, or hit a large fraction of rollouts.

Env workers: exceptions in env code, timeouts, sandbox errors, OOM kills (most common source — runs user code).
Orchestrator: empty/errored rollout spikes, weight-broadcast failures, checkpoint errors.
Trainer: NCCL/CUDA errors, OOM, NaN loss or gradients.
Inference: NCCL/CUDA errors, OOM, request timeouts.

Process tree

All processes use setproctitle so they're visible in ps/htop/pstree:

PRIME-RL::Launcher
├── PRIME-RL::Inference          (vLLM server, GPU 0)
├── PRIME-RL::Orchestrator       (CPU-only)
│   └── Verifiers::EnvServer     (ZMQ env server per environment)
│       └── Verifiers::EnvWorker0..N
├── torchrun
│   └── PRIME-RL::Trainer        (GPU 1+)
└── tail trainer.log

For multi-node runs, trainer and inference processes are on separate nodes — use srun or ssh to inspect them.