name: run-rlhf-code-experiment description: Plan, run, and report a small RLHF Book code experiment. allowed-tools: Bash(uv:), Bash(git:), Read, Edit
Run RLHF Code Experiment
Use this skill when the user wants to run, adapt, compare, or document an experiment from code/.
Pick The Starting Point
- Policy gradients / RL / GRPO / PPO: read
code/policy_gradients/README.md. - Reward models / ORM / PRM / Bradley-Terry RM: read
code/reward_models/README.md. - DPO / IPO / SimPO / ORPO / KTO / APO: read
code/direct_alignment/README.md. - Rejection sampling / best-of-N / GSM8K filtering: read
code/rejection_sampling/README.md.
Run Protocol
- Work from the repository root unless a command explicitly says
cd code/. - Install or refresh dependencies with
cd code/ && uv synconly when needed. - Use
uv run python, never barepython. - Start with a short run:
- Reward models: lower
--samplesand--epochs. - Direct alignment: use
--max_samplesor copy a YAML with a smaller sample count. - Policy gradients: copy a YAML and reduce
data.sizebefore changing algorithm logic. - Rejection sampling: reduce
max_train_samples,max_test_samples, ornum_completions_per_promptin a copied YAML.
- Reward models: lower
- For any long training, preprocessing, evaluation, or sweep command, launch the command in the background rather than the foreground. In Claude Code, use the background-run option for the shell command, then start a monitor for it.
- Watch the monitor until the run has produced initial logs or failed. The Claude Code status bar should show a background task and monitor (for example,
[1 background task] [1 monitor]). Keep checking the monitor periodically for loss, metrics, W&B URLs, OOMs, dataset download errors, and stalled output. - Run one training job at a time unless GPU memory has been checked.
- If W&B is not desired, set
WANDB_MODE=disabledor use the module's no-W&B flag when available.
What To Report
Report enough detail for another reader to reproduce the result:
- Exact command.
- Model, dataset, seed, and config file.
- Config values changed from the checked-in defaults.
- Final metrics and any observed failure mode.
- W&B run URL if logging was enabled.
- Follow-up sweep worth trying next.
Comparison Rules
- For policy gradients, compare
avg_correctness,avg_format,avg_binary, loss, and whether sampled groups contain reward contrast. - For reward models, compare reward margins or correctness scores on held-out examples, not just training loss.
- For direct alignment, compare
accuracy,margins,chosen_rewards,rejected_rewards, and sample generations. IPO loss scale is not directly comparable to DPO loss scale. - For rejection sampling, always compare each reward-selected run to its matched random baseline.
Documentation Rule
If the run exposes a new setup requirement, failure mode, or useful workflow shortcut, update the relevant README, code/CLAUDE.md, or this skill before finishing.