monitor-run

star 1.5k

Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.

PrimeIntellect-ai By PrimeIntellect-ai schedule Updated 5/20/2026

name: monitor-run description: Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.

Monitor a run

Runbook

On launch

  1. Find the output dir and read the resolved configs at {output_dir}/configs/ (start with rl.toml).
  2. Confirm all processes are alive and the run is making progress.
  3. Write the initial summary into {output_dir}/STATUS.md.

Recurring check-ins

Default cadence: 1 hour (researcher can override). At each check-in:

  1. Confirm processes are alive.
  2. Grep logs for errors/warnings; note current step and key metrics.
  3. Append an entry to {output_dir}/STATUS.md (never overwrite):
## YYYY-MM-DD HH:MM UTC

**Step**: {current_step} / {max_steps}
**Health**: {Healthy | Degraded | Down}

**Progress**: reward/mean, seq_len, truncation, eval scores, env-specific metrics.
**Stability**: entropy, mismatch_kl, grad_norm — flag spikes.
**Performance**: trainer vs orchestrator step time, env lag, inference pressure.

**Notes**: anything unusual (errors, restarts, hangs). Omit if nothing notable.

Restarting a run

Never restart unless the researcher explicitly asked. Confirm the exact restart command and the conditions that warrant one.

Never run kill or launch commands from your own shell. Dispatch them to the tmux Launcher window so the researcher sees what was executed:

SESSION=$(tmux display-message -p '#S')
tmux send-keys -t "$SESSION:Launcher" 'your command here' Enter

After a restart, verify all processes are back up and progress resumed before the next check-in.


Reference

Where to find things

  • scripts/tmux.sh launches the run with a Launcher window in the named tmux session. The Claude window receives the output dir and session name in its appended prompt — if either is missing, ask rather than guess.
  • {output_dir}/configs/ — resolved TOMLs (rl.toml has the full picture).
  • {output_dir}/logs/ — see below.
  • {output_dir}/rollouts/step_N/ — saved rollouts.

Logs

{output_dir}/logs/
├── trainer.log                # rank 0 stdout
├── orchestrator.log           # orchestrator stdout
├── inference.log              # vLLM stdout
├── trainer/
│   ├── node_*.log             # per-node (multi-node only)
│   └── torchrun/              # per-rank stdout/stderr
├── inference/
│   ├── node_*.log             # per-node (multi-node only)
│   └── router_0.log           # vllm-router per replica (multi-node only)
└── envs/{train,eval}/{env_name}.log    # one log file per env

Usually tailing trainer.log, orchestrator.log, and inference.log is enough. Drop into per-node or per-rank logs only when debugging. All logs are loguru with HH:mm:ss LEVEL message; levels: DEBUG, INFO, SUCCESS, WARNING, ERROR.

Scan for problems:

grep -E "WARNING|ERROR" {output_dir}/logs/{trainer,orchestrator,inference}.log
grep -E "WARNING|ERROR" {output_dir}/logs/envs/{train,eval}/*.log

Metrics

All metrics print to the console log (and W&B when configured).

Progress — orchestrator log:

Metric Description
reward/{all,env}/mean mean training reward
seq_len/{all,env}/mean avg sequence length (tokens)
num_turns/{all,env}/mean avg turns per rollout (multi-turn only)
is_truncated/{all,env}/mean fraction truncated
empty_rollouts/{all,env}, errored_rollouts/{all,env} fraction empty/errored
metrics/{env}/{metric} env-specific (e.g. pass rate)
eval/{env}/{avg@k,pass@k} eval scores when configured

Stability — trainer log:

Metric Description
mismatch_kl/{all,env}/{mean,std,max} KL between trainer and (old) inference policy over trainable tokens
entropy/{all,env}/{mean,std,max} policy entropy over trainable tokens
masked_advantage_{positive,negative}/mean fraction of DPPO-masked tokens with +/- advantage
optim/grad_norm spikes may precede divergence

Performance — trainer and orchestrator step independently, so comparing step times shows who's waiting on whom.

Source Metric Description
trainer time/step total trainer step
trainer time/wait_for_batch high → orchestrator is bottleneck
trainer time/forward_backward, time/broadcast_weights, time/save_ckpt phase timings
trainer perf/throughput, perf/mfu tokens/s and MFU %
orchestrator time/step, time/generate_completions, time/update_weights phase timings
orchestrator time/wait_for_ckpt high → trainer is bottleneck
orchestrator scheduler/async_level, scheduler/inflight_rollouts scheduler state
env server event loop lag (min/mean/p90/p99/max), active task distribution periodic

For live vLLM stats, query Prometheus directly:

curl -s http://localhost:8000/metrics | grep -E "num_requests|gpu_cache_usage"
# vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc (→1.0 = KV cache saturated)

Rollouts

{output_dir}/rollouts/step_N/
├── train_rollouts.jsonl   # all train rollouts (vf.RolloutOutput, trajectory excluded)
├── eval_rollouts.jsonl    # only present when eval ran
└── train_rollouts.bin     # binary batch consumed by the trainer
wc -l {output_dir}/rollouts/step_42/train_rollouts.jsonl
head -1 {output_dir}/rollouts/step_42/train_rollouts.jsonl | python -m json.tool
jq '.reward' {output_dir}/rollouts/step_42/train_rollouts.jsonl

Common failure modes

A few warnings are normal. Escalate when errors are persistent, growing, or hit a large fraction of rollouts.

  • Env workers: exceptions in env code, timeouts, sandbox errors, OOM kills (most common source — runs user code).
  • Orchestrator: empty/errored rollout spikes, weight-broadcast failures, checkpoint errors.
  • Trainer: NCCL/CUDA errors, OOM, NaN loss or gradients.
  • Inference: NCCL/CUDA errors, OOM, request timeouts.

Process tree

All processes use setproctitle so they're visible in ps/htop/pstree:

PRIME-RL::Launcher
├── PRIME-RL::Inference          (vLLM server, GPU 0)
├── PRIME-RL::Orchestrator       (CPU-only)
│   └── Verifiers::EnvServer     (ZMQ env server per environment)
│       └── Verifiers::EnvWorker0..N
├── torchrun
│   └── PRIME-RL::Trainer        (GPU 1+)
└── tail trainer.log

For multi-node runs, trainer and inference processes are on separate nodes — use srun or ssh to inspect them.

Install via CLI
npx skills add https://github.com/PrimeIntellect-ai/prime-rl --skill monitor-run
Repository Details
star Stars 1,484
call_split Forks 313
navigation Branch main
article Path SKILL.md
More from Creator
PrimeIntellect-ai
PrimeIntellect-ai Explore all skills →