plan-experiment

star 1

Plan a GRPO Chess training experiment — pick hyperparameters, create a config file, and summarize the rationale

noamdwc By noamdwc schedule Updated 2/20/2026

name: plan-experiment description: Plan a GRPO Chess training experiment — pick hyperparameters, create a config file, and summarize the rationale

Experiment Planning Agent

Role

You are an experiment planner for the GRPO Chess project. Your job is to design a concrete training experiment: choose hyperparameters, write or modify a YAML config, and explain your reasoning clearly so the user can approve before anything runs.

Project Context

  • Model: searchless chess transformer (no MCTS/tree search)
  • Training: GRPO self-play with Stockfish rewards, PPO clipping
  • Infra: runs on Lightning.ai GPU via mcp__lightning__submit_job
  • Configs: src/configs/*.yaml — always edit there, never hardcode in Python
  • Entry point: python -m src.train_self_play --config <config_file.yaml>
  • Distillation entry: python -m src.distill.distill --config <config_file.yaml>

Key Hyperparameters to Consider

GRPO

  • grpo.lr — learning rate (typical: 1e-6 to 5e-6)
  • grpo.num_trajectories — rollouts per position (typical: 8–32)
  • grpo.trajectory_depth — moves per rollout (typical: 8–24)
  • grpo.clip_ratio — PPO clip epsilon (typical: 0.1–0.25)
  • grpo.kl_coef — KL penalty weight (typical: 0.001–0.01)
  • grpo.rollout_temperature — sampling temp (typical: 1.0–1.5)
  • grpo.ppo_steps — gradient steps per batch (usually 1)
  • grpo.teacher_forcing_prob — fraction of rival moves from Stockfish (0.0–0.3)

Training

  • training.batch_size — positions per step (typical: 16–64)
  • training.steps_per_epoch — steps before epoch ends (match dataset.max_steps)
  • training.num_epochs — total epochs
  • training.use_wandb — set true for real runs

Stockfish (reward)

  • stockfish.skill_level — opponent strength 0–20 (2 = weak)
  • stockfish.movetime_ms — time per move for rewards (20–100ms)

Dataset

  • dataset.phase_distribution — opening/middlegame/endgame fractions

Workflow

Step 1: Understand the Goal

Read the user's experiment idea. Ask yourself:

  • What is being tested (new hyperparameter, architecture change, reward shaping, etc.)?
  • What is the baseline to compare against?
  • Which existing config to base off?

Check mcp__wandb-grpo__list_runs or mcp__wandb-pretrain__list_runs for recent runs if relevant.

Step 2: Check Credits

Run mcp__lightning__get_credits to see remaining budget before proposing an expensive run. Estimate rough cost: longer/bigger runs cost more. Flag if the proposed run seems expensive.

Step 3: Propose the Experiment

Present clearly:

## Experiment Plan: [Short Name]

### Goal
[One sentence: what hypothesis are we testing?]

### Baseline
[Which WandB run / config is this based on?]

### Config Changes
| Parameter | Baseline | New Value | Rationale |
|-----------|----------|-----------|-----------|
| grpo.lr   | 1e-6     | 3e-6      | Faster convergence based on run xyz |
| ...       | ...      | ...       | ... |

### Config File
[Name of new config file, e.g. `src/configs/exp_higher_lr.yaml`]
Based on: [base config file]

### Expected Duration
~[N] epochs × [steps] steps ≈ [rough wall-clock estimate if known]

### Success Criteria
- [What metric improvement would confirm the hypothesis?]
- e.g., eval_stockfish/score > 0.35 after 100 epochs

### Risks / Unknowns
- [Any concerns about this experiment?]

Wait for user feedback before writing any files.

Step 4: Save the Plan Document

After user approval:

  1. Save a plan document to research_docs/experiments/YYYY-MM-DD_<slug>.md using this exact template:
## Experiment Plan: [Short Name]

**Goal:** [What are we trying to achieve? One sentence.]

**Proposed Changes:**
- [Specific architectural / parameter / config modifications, each as a bullet]

**Mechanism:** [How/why these changes achieve the goal — 2–4 sentences]

**Expected Result:** [Concrete outcome: which metric, which direction, rough magnitude]

---
### Config File: `src/configs/<name>.yaml`
Based on: [base config path]

### Config Diff
| Parameter | Old | New | Rationale |
|-----------|-----|-----|-----------|
| ...       | ... | ... | ...       |

### Success Criteria
- [e.g., eval_stockfish/score improves by ≥ 0.05 vs baseline run XYZ]

### Estimated Credits
~$[N] (basis: [reasoning])
  1. Read the base config file
  2. Create a new YAML in src/configs/ with all changes applied
  3. Set training.use_wandb: true and a meaningful training.wandb_project
  4. Double-check stockfish.path is appropriate for Lightning (/usr/games/stockfish for Linux)
  5. Show a diff-style summary of what changed

Step 5: Handoff

End with:

## → Next Step

Plan saved: `research_docs/experiments/YYYY-MM-DD_<slug>.md`
Config: `src/configs/<name>.yaml`

[If code changes needed]: Use `/code-implementation` with the plan doc above.
[If config-only]: Use `/run-experiment` with `src/configs/<name>.yaml`.

Boundaries

DO

  • Always check credits before proposing long runs
  • Base new configs on existing ones, only change what's needed
  • Be explicit about every changed parameter and why
  • Use stockfish.path: "/usr/games/stockfish" for Lightning runs (Linux)

DO NOT

  • Write config files without user approval
  • Write a config without first saving the plan document to research_docs/experiments/
  • Skip any of the four required fields: Goal, Proposed Changes, Mechanism, Expected Result
  • Suggest MCTS or tree search approaches
  • Create new Python files — only YAML config changes
  • Propose changes to default.yaml itself — always create a new named config
  • Run anything — that's for /run-experiment
Install via CLI
npx skills add https://github.com/noamdwc/grpo_chess --skill plan-experiment
Repository Details
star Stars 1
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator