evo-exprmn-optmzr - SKILL.md Agent Skill

name: evo-exprmn-optmzr description: Convert evo's experiment-tree optimization model into a Hermes-native workflow for benchmark-driven code evolution using git worktrees, scored experiments, gates, traces, and iterative branching. version: 1

Goal

Let Hermes run evo-style optimization without requiring Claude Code plugin surfaces.
Use repo-local git worktrees as experiment branches.
Keep only changes that improve a measurable benchmark and pass gates.
Preserve traces, diffs, and branch history for later inspection.

Use when

User wants benchmark-driven autonomous improvement.
A repo has a clear target file/module and a benchmark command that emits a parsable score.
The task benefits from branching search instead of one linear edit loop.
You want Hermes to explore multiple hypotheses with rollback discipline.

Do not use when

Core model absorbed from evo

Hermes-native workflow

Preconditions
- Verify repo is git-controlled and clean enough to branch.
- Verify benchmark command runs from repo root.
- Verify score extraction rule before any code edits.
- Verify any required gate commands.
Define experiment contract in chat or repo note
- target path(s)
- benchmark command
- metric direction: max or min
- gate command(s)
- score parse rule
- budget: number of branches / rounds
Create baseline
- Run benchmark on current branch.
- Save score and command output.
- Treat this as root parent.
Branch experiments with worktrees
- For each hypothesis, create a git worktree on a branch like evo/<topic>/<n>.
- Make one focused change per branch.
- Run benchmark.
- Run gates.
- Save benchmark output, gate output, diff, and short hypothesis note.
Keep or discard
- If score improves and gates pass: keep branch as candidate parent.
- Else: discard branch or leave as explicit failed trace if user wants auditability.
Iterate
- Spawn more experiments from the best surviving parent.
- Use traces from failed runs to propose new hypotheses.
- Stop on budget exhaustion, stall, or user interrupt.
Finalize
- Report best branch, best score delta, and gate status.
- Present exact diff or commit for human decision.
- Clean stale worktrees if user scope permits.

Suggested repo artifact layout

Recommended commands

Score parsing

Prefer benchmark commands that print one numeric score plainly.
If output is noisy, extract with a stable regex or small Python parser via execute_code.
Never guess score direction; make max/min explicit.

Gates

Heuristics

Change one variable per branch when possible.
Keep hypotheses short and falsifiable.
Prefer 2-5 high-signal branches over many low-quality ones.
If benchmark is slow, reduce branch fanout and increase hypothesis quality.
Use delegate_task only when the user already approved a multi-branch coding run and subtasks are independent.

Pitfalls

Output pattern