name: evolve license: MIT description: >- Research-driven multi-cycle improvement director. Forms causal hypotheses about why scores are low, validates them with scout agents before attacking, dispatches axis-parallel fleet attacks, extracts transferable patterns, and runs indefinitely within a budget envelope. Accumulates a persistent belief model and pattern library across sessions. user-invocable: true auto-trigger: false trigger_keywords: - evolve - sustained improve - improvement director - research-driven improve - multi-cycle improve - run until done - improve until ceiling - keep improving - hypothesis - belief model - scout agents last-updated: 2026-05-03
/evolve — Improvement Director
Orientation
Use when: You want sustained autonomous quality advancement — the director forms hypotheses, scouts before attacking, and builds a belief model that compounds across cycles. Runs until a natural ceiling, budget exhaustion, or you say stop.
Don't use when: You want a single scored loop (/improve), a known axis
attacked directly (/improve --axis), or a one-time audit (/improve --score-only).
Key difference from /improve: /improve follows the rubric mechanically.
/evolve asks why scores are where they are, validates those theories before
spending fleet budget, and extracts cross-skill patterns that propagate to skills
never directly attacked.
Invocation
/evolve {target} # run until ceiling, velocity drop, or budget
/evolve {target} --n={N} # exactly N director cycles then stop
/evolve {target} --budget=${X} # run until cumulative spend reaches $X
/evolve {target} --continue # resume from saved director state
/evolve {target} --status # show belief model, velocity, spend — no attack
/evolve {target} --axis={name} # focus director on one axis (scout + attack only)
target maps to .planning/rubrics/{target}.md.
If no rubric exists, run /improve {target} Phase 0 first — /evolve requires
an approved rubric and will not auto-generate one.
Campaign Artifacts
All findings are externalized incrementally — written after every phase, not only at cycle end. A crashed or compacted session resumes with full context.
| Artifact | Path | Contents |
|---|---|---|
| Director state | .planning/evolve/{target}/director-state.json |
cycle count, spend, velocity history, current phase, halt status |
| Belief model | .planning/evolve/{target}/belief-model.jsonl |
one record per (axis, skill) per cycle: score, hypothesis, evidence, confidence |
| Experiment log | .planning/evolve/{target}/experiment-log.jsonl |
every experiment: hypothesis → prediction → actual delta → mechanism confirmed |
| Pattern library | .planning/evolve/{target}/pattern-library.md |
transferable patterns: what change to what axis class caused what delta in which skills |
| Cycle digest | .planning/evolve/{target}/cycle-{n}-digest.md |
human-readable per-cycle summary for review |
| Global patterns | .planning/research/patterns.md |
cross-target patterns written outside campaign scope; available to future sessions and other targets |
| Knowledge wiki | .planning/wiki/ |
compiled wiki pages from /learn; integrates evolve discoveries across sessions |
Create .planning/evolve/{target}/ on first invocation. Create .planning/research/ if absent.
Cycle digest contents: scores table (axis, prior, this cycle, delta), hypotheses table (id, axis, hypothesis, scout result, confidence), what was attacked (axis, skill, delta, mechanism confirmed), patterns discovered this cycle, belief model updates, and the spend/velocity line. Full template: docs/QUALITY_LOOPS.md#cycle-digest-format.
Director Cycle Protocol
Phase 1: Survey
Run /improve {target} --score-only. Record scores to belief model with delta
from prior cycle (empty on cycle 1). Flag any axis that dropped since last cycle
as regression-watch — these are checked first in Phase 2.
Phase 2: Hypothesize
For every axis below 8.0, generate one primary hypothesis in this form:
HYPOTHESIS: {axis} scores {n}/10 because {specific mechanism},
not because {common misread}.
PREDICTION: Fixing {mechanism} will raise score ≥ {delta} across {N} skills.
FALSIFICATION: If we apply {change} and score does not rise > 0.5, hypothesis rejected.
Draw hypotheses from: evaluator justifications in Phase 1, prior evidence in the belief model, and programmatic check failures. Do not hypothesize from score alone — the number is the symptom.
Write each hypothesis to the experiment log as { id, status: "pending", ... }.
Skip hypothesis generation for an axis if the belief model already has a
confidence >= 0.8 confirmed hypothesis for it that has not yet been attacked.
Phase 3: Scout
For axes below 7.0, or axes with unconfirmed hypotheses: dispatch one scout agent per hypothesis. Scouts read — they do not modify files.
Each scout returns { "hypothesis_id", "confirmed", "evidence", "confidence" }
(schema example: docs/QUALITY_LOOPS.md#scout-result-schema).
Scout confidence protocol: Scouts read relevant files only — no edits, no test runs. Assign confidence:
- 0.9+: mechanism is directly observable (explicit absence, missing section, wrong value in file)
- 0.7–0.89: strong indirect evidence from 2+ corroborating observations
- 0.4–0.69: single observation that supports the hypothesis; alternative explanations plausible
- < 0.4: no direct evidence found; hypothesis is speculative from this file set
Run scouts in parallel. Update experiment log:
confidence >= 0.7→confirmedconfidence 0.4–0.69→needs-evidence(do not attack; add to next cycle)confidence < 0.4→rejected
Skip Phase 3 for any hypothesis already confirmed at confidence >= 0.8 in
the belief model from a prior cycle.
Phase 4: Prioritize
For each confirmed hypothesis compute:
EV = (delta_estimate × axis_weight × confidence) / (effort_tier × collision_multiplier)
effort_tier: low=1.0, medium=1.5, high=2.5collision_multiplier: 2.0 if axis shares primary files with another attack in this cycle
Select top K axes where K = min(confirmed count, 4). Document selection rationale
in cycle digest. If --axis was set, skip ranking — attack only that axis.
Phase 5: Fleet Attack
Dispatch one agent per selected axis in an isolated worktree
(Agent tool, isolation: "worktree"). Each agent receives:
- The confirmed hypothesis and its falsification criterion
- The specific files to modify
- Verification oracle:
node scripts/run-with-timeout.js 300 node scripts/test-all.js
Each agent returns a structured result:
{ "axis", "skill", "delta", "mechanism_confirmed", "files_changed", "approach" }
(schema example: docs/QUALITY_LOOPS.md#fleet-agent-result-schema).
Merge rules:
- Non-conflicting worktrees: merge all
- Conflicting worktrees (same file): keep higher delta, discard lower
- Regression on any previously passing programmatic check: abort that worktree, do not merge
mechanism_confirmed: false(score improved but not via predicted mechanism): record asincidental_improvement, mark hypothesis asneeds-revision
Commit each merged worktree with a message citing the hypothesis ID.
After committing changes to any SKILL.md (here or in Phase 7), run /reload-skills if the running Claude Code version supports it so the change is live this session; otherwise note that a fresh session is required before the updated skill takes effect.
Phase 6: Synthesize
For each result:
- Update belief model — append evidence record for (axis, skill)
- Update experiment log — mark
verified/refuted/incidental - Identify transferable patterns:
PATTERN: {axis_class} | Mechanism: {what caused improvement} | Delta: {avg} across {N} instances | Applies to: {skill list} | Confidence: high/medium/low
Write patterns to .planning/evolve/{target}/pattern-library.md.
Compile into wiki: After writing to the pattern library, call
/learn --from-evolve {target} --cycle {n}. This compiles cycle discoveries
into .planning/wiki/ — integrating with findings from prior cycles and
campaigns rather than siloing them in the evolve directory. Skip if /learn
is not available in this session (log the skip, do not block the cycle).
Phase 7: Cross-Pollinate
For each confidence: high pattern, or any pattern confirmed in 2+ skills:
apply to all other applicable skills as targeted single-file edits — without
running a full attack cycle.
Run verification oracle per cross-pollinated skill. Commit only if all
programmatic checks pass and no axis drops > 0.3. Revert on regression; mark
pattern as context-dependent.
Write patterns that apply beyond this target to .planning/research/patterns.md.
Phase 8: Loop or Halt
Compute learning velocity:
velocity = Σ(delta across all attacked axes this cycle) / axes_attacked
Append to director-state.json velocity history.
Halt conditions (check in order):
--ncycles completed--budgetreached (cumulative cost ≥ limit)- All axes ≥ 9.0 across all scored skills
velocity < 0.2for 3 consecutive cycles AND noneeds-evidencehypotheses remain- Level-up triggered (see below)
- User says stop
On velocity drop, before halting: attempt one axis-class switch — attack the highest-EV axis from a category not touched in the last 2 cycles. If velocity is still < 0.2 after that cycle, halt.
On level-up trigger (no axis improved > 0.5 for 2 loops, ≥ 3 loops run,
no programmatic failures): write level-up proposals to
.planning/rubrics/{target}-proposals.md, set status: level-up-pending in
director state, halt. The campaign resumes only after the human approves and
edits the live rubric.
On normal loop: increment cycle, compress prior cycle findings to continuation context, return to Phase 1.
Unlimited Mode
No --n and no --budget = unlimited. Declare before starting: target, exit
conditions (all axes ≥ 9.0 OR velocity < 0.2 for 3 cycles), estimated cost
($12–18/cycle), spend so far ($0), and how to halt (type /stop or press Escape
to stop after the current cycle). At the end of every cycle report cycle spend,
cumulative spend, and velocity. Literal declaration and report templates:
docs/QUALITY_LOOPS.md#unlimited-mode-templates.
When context approaches compression territory (session duration > 30 min or
/compact recommended): write continuation checkpoint to director state,
surface the --continue command. The next session picks up exactly where this
one stopped.
For overnight / unattended runs: combine with /daemon. The director is
daemon-compatible — daemon calls /evolve {target} --continue each session.
Set --budget to cap total spend.
Fringe Cases
.planning/does not exist: error — run/do setupfirst to initialize the harness state directory, then retry.- No rubric: error — run
/improve {target}Phase 0 first. List available targets in.planning/rubrics/as hint. - No prior scores in belief model: proceed from cycle 1; all deltas empty on first survey. Expected.
- All scouts return
needs-evidence: attack the top-EV axis anyway underlow-confidenceflag; record as exploratory. Mark result regardless. - Scout agent hangs or times out (dispatched scout never returns): After 10 minutes without a response, log the scout as
status: timed-outin the experiment log withconfidence: 0. Proceed with the remaining returned scouts. Never let a hung scout block the cycle — if all scouts time out, treat as "all scouts return needs-evidence" and attack the top-EV axis underlow-confidenceflag. - All axes collide (every axis shares files): serialize top 2 axes; parallelize remainder. Log collision.
- Cross-pollination causes regression: revert that skill, mark pattern
context-dependent, do not propagate further. - Level-up mid-campaign: pause, write proposals, set
level-up-pending./evolve --continueafter human approval resumes cycle numbering from where it stopped. - Budget overrun risk: if projected spend for current cycle would exceed
--budgetby > 20%, warn and confirm before dispatching fleet. --continuewith no director state: error — no campaign to resume. Suggest/evolve {target}to start fresh.- Pattern library > 50 entries: consolidate — group by axis class, merge similar patterns, keep highest-confidence instance of each class. Log consolidation.
- Zero skills match target rubric: error with message listing all
.planning/rubrics/*.mdtargets.
Quality Gates
- Every hypothesis must have an explicit falsification criterion before Phase 3
- Scouts must run before fleet dispatch on any unconfirmed hypothesis
- Belief model written after every phase, not only end of cycle
- Cross-pollination requires passing verification oracle before commit
- Regression on any previously-passing axis aborts that worktree commit
- Pattern library and global patterns updated at every cycle end, even zero-improvement cycles
- Cycle digest written even on abort or no-change cycles
Contextual Gates
Disclosure:
- State mode (unlimited / fixed / budget) and exit conditions before first cycle
- Estimate $12–18 per full cycle before starting
- Report spend and velocity at end of every cycle
- Confirm before continuing if cumulative spend exceeds $50
Reversibility: Red. Cross-pollination modifies many files across the repo;
level-up rewrites rubric anchors permanently. Each commit is individually
revertable; high volume. Range: git revert {first}^..{last}.
Trust gates:
- Novice (0-4 sessions):
--statusand--n=1only; unlimited blocked - Familiar (5-19): up to
--n=5; unlimited requires explicit--budgetcap - Trusted (20+): no cap; confirm if projected total > $100
Exit Protocol
---HANDOFF---
- Target: {target} | Cycles: {n} | Spend: ${total} | Mode: {unlimited/n/budget}
- Axes improved: {list with deltas}
- Belief model: .planning/evolve/{target}/belief-model.jsonl ({N} confirmed, {M} rejected)
- Pattern library: .planning/evolve/{target}/pattern-library.md ({N} patterns)
- Global patterns: .planning/research/patterns.md
- Knowledge wiki: .planning/wiki/index.md (compiled via /learn --from-evolve after each cycle)
- Cycle digests: .planning/evolve/{target}/cycle-*-digest.md
- Halt reason: {ceiling/velocity/budget/n-complete/user-stop/level-up-pending}
- Level-up proposals: {path or N/A}
- Reversibility: red — {N} commits across {M} files; revert range: git revert {range}
- Recommended next: {level-up and re-run / new target / done}
---