eval-runner - SKILL.md Agent Skill

name: eval-runner description: "Run eval scenarios to benchmark Mycelium effectiveness. Execute tasks using reflexion loop, validate against success criteria, record metrics." metadata: instruction_budget: "41" framework_dependency: "mycelium" framework_dependency_note: "This skill is designed to run within the Mycelium framework (https://github.com/haabe/mycelium). Standalone use will skip the canvas state, theory gates, and harness behavior the skill assumes. Install: /plugin install mycelium@haabe/mycelium."

Eval Runner

Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.

Commands

`run <category/name>`

Read YAML from .claude/evals/scenarios/<category>/<name>.yml
Parse fields (name, category, task_prompt, success_criteria, budget)
Execute setup steps if defined
Record start time
Execute task via reflexion workflow (read corrections first)
Record end time and iteration count
Validate ALL success criteria
Write result JSON to .claude/evals/results/<timestamp>-<name>.json
Report summary

`run-all [category]`

Glob .claude/evals/scenarios/**/*.yml
Skip scenarios with status: retired
For each: run in isolation (git stash), record result, restore
Update .claude/evals/pass-history.json with each result
Aggregate and report

`run-split <optimization|holdout>`

Glob .claude/evals/scenarios/**/*.yml
Read each YAML, filter by split field matching the requested set
Skip scenarios with status: retired
For each matching scenario: run in isolation, record result, restore
Update .claude/evals/pass-history.json with each result
Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")

`report`

Read all results from .claude/evals/results/
Generate summary table:

| Category    | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery   | ...       | ...            | ...      |       |
| delivery    | ...       | ...            | ...      |       |
| integration | ...       | ...            | ...      |       |
| **Overall** | ...       | ...            | ...      |       |

List failure patterns and recommendations

`prune`

Read .claude/evals/pass-history.json
Flag evals where last_5 is all-pass (saturated) or all-fail (broken)
Flag evals with no runs in 30+ days (stale)
For saturated evals, suggest: retire or increase difficulty
For broken evals, suggest: fix criteria or retire
Present recommendations — do NOT auto-retire
On user confirmation: set status: retired in scenario YAML, update pass-history.json, log in .claude/harness/decision-log.md

`mine`

Analyze audit logs to propose new eval scenarios from observed failure patterns.

Read .claude/state/change-log.jsonl (last 100 entries)
Read .claude/state/diamond-state-audit.jsonl (all entries)
Group change-log entries by session_id, identify: a. Correction clusters: 3+ edits to same file in one session (agent struggled) b. Skill friction: edits to .claude/skills/*/SKILL.md during a session (instructions unclear) c. Missing test coverage: 5+ files changed with no test file edits
Count diamond bypass entries from audit log
Cross-reference with existing scenarios in pass-history.json to avoid duplicates
For each pattern, output a proposed eval scenario as YAML template
Tag proposals with source: trace-mining and originating session_id
Do NOT auto-create files — present for human review

See .claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.

Result Format

See .claude/evals/schema.md for YAML scenario and JSON result formats.

Pass History

After writing each result JSON (step 8 of run), also update .claude/evals/pass-history.json:

Increment runs and passes (if passed) for the eval
Append true/false to last_5 (trim to keep only last 5)
Update last_run timestamp

Creating Scenarios

Place YAML files in .claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.