name: eval-creator description: "[Beta] Creates permanent eval cases from promoted learnings and runs regression checks against them. Turns failures into test cases that prevent silent regression. This is the outer loop's regress-test step. Use when a learning is promoted and has a clear pass/fail condition, or on cadence to verify promoted rules still hold."
Eval Creator
Turns promoted learnings into permanent eval cases. Runs regression checks to verify promoted rules hold. This is the outer loop's regress-test step.
The blog says: "If a failure taught you something important, it should become a permanent test case. Otherwise the knowledge is still fragile."
When to Use
- After harness-updater promotes a pattern — create an eval for it
- On cadence — run all evals to check for regression
- Before major releases — verify the harness is holding
- When a promoted rule seems to have stopped working — diagnose with targeted eval run
Eval Directory Structure
.evals/
EVAL_INDEX.md # Index of all eval cases with status
cases/
eval-YYYYMMDD-001.md # Individual eval case
eval-YYYYMMDD-002.md
...
Creating an Eval Case
Input
From harness-updater or manually:
- Pattern-Key of the promoted learning
- The rule that was added to the project instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md)
- What to test (the assertion)
- Verification method
Eval Case Format
---
id: eval-YYYYMMDD-NNN
pattern-key: [from learning]
source: [LRN-YYYYMMDD-001, ERR-YYYYMMDD-003]
promoted-rule: "[the rule text in project instruction files]"
promoted-to: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md, or equivalent
created: YYYY-MM-DD
last-run: YYYY-MM-DD
last-result: pass | fail | skip
---
## What This Tests
[One sentence: what failure this eval prevents from recurring]
## Precondition
[What must be true for this eval to be runnable]
- File X exists
- Project uses framework Y
- etc.
## Verification Method
[One of: grep-check, command-check, file-check, rule-check]
### grep-check
Search for a pattern that should (or should not) exist:
target: src/**/*.ts pattern: "hardcoded-secret-pattern" expect: not_found
### command-check
Run a command and check the exit code or output:
command: npm run typecheck expect_exit: 0
### file-check
Verify a file or section exists:
target: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md section: "## Verification" expect: exists
### rule-check
Verify a rule exists in an instruction file:
target: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md contains: "[the promoted rule text or key phrase]" expect: found
## Expected Result
**Pass:** [What "good" looks like]
**Fail:** [What regression looks like]
## Recovery Action
If this eval fails:
1. [Specific step to diagnose]
2. [Specific step to fix]
3. Re-run this eval to verify
Running Evals
Run All
Read .evals/EVAL_INDEX.md, iterate through all cases, execute each verification method.
Run by Pattern-Key
Filter to evals matching a specific pattern.
Run by Area
Filter to evals whose source files match an area (frontend, backend, etc.).
Execution
For each eval case:
- Check precondition — if not met, mark as
skip - Execute verification method:
grep-check: Use Grep tool to search target files for the patterncommand-check: Run the command via Bash, check exit code and/or outputfile-check: Use Read/Glob to verify file/section existencerule-check: Read the target file, search for the expected contentskill-check: Runquick_validate.pyon a skill directory (see Skill Validation below)script-check: Run a custom mcp-script by name (see Custom Verification Methods)
- Compare result to expected
- Update
last-runandlast-resultin the eval case file - Update
EVAL_INDEX.mdwith the result
Regression Report
## Eval Run: YYYY-MM-DD
**Total:** N evals
**Passed:** N
**Failed:** N
**Skipped:** N
### Failures
#### eval-YYYYMMDD-001 — [pattern-key]
- **What regressed:** [description]
- **Expected:** [X]
- **Got:** [Y]
- **Recovery action:** [from eval case]
### Summary
[All green / N regressions need attention]
Eval Index Format
.evals/EVAL_INDEX.md:
# Eval Index
| ID | Pattern-Key | Rule Summary | Last Run | Result | Created |
|----|-------------|-------------|----------|--------|---------|
| eval-YYYYMMDD-001 | auth-middleware-lock | Run migrations on test DB first | YYYY-MM-DD | pass | YYYY-MM-DD |
| eval-YYYYMMDD-002 | pnpm-not-npm | Use pnpm in this repo | YYYY-MM-DD | fail | YYYY-MM-DD |
Integration
Upstream
- harness-updater flags eval candidates after promoting a pattern (harness-updater is a plugin-bundle agent,
plugin/agents/harness-updater.md; without the plugin, the human applying promotions manually flags eval candidates from the gap report instead) - learning-aggregator identifies patterns with clear pass/fail conditions
Downstream
- Regression failures feed back into self-improvement as new error entries
- Persistent failures may indicate the promoted rule needs refinement → feed back to harness-updater
Scheduled Use
For projects with a CI pipeline, eval-creator can run as a scheduled check:
- Weekly: run all evals
- Per-PR: run evals related to changed files
- Post-promotion: run the newly created eval immediately
Custom Verification Methods (mcp-scripts)
Beyond the four built-in methods (grep-check, command-check, file-check, rule-check), projects can define custom verification tools as mcp-scripts for complex assertions that the built-ins can't express.
Example — an eval that verifies a promoted auth rule is enforced:
# In gh-aw workflow config
mcp-scripts:
check-auth-middleware:
lang: javascript
description: "Verify all /admin routes have auth middleware"
run: |
const routes = require('./src/routes/admin');
const unprotected = routes.filter(r => !r.auth);
if (unprotected.length) {
console.error('Unprotected admin routes:', unprotected.map(r => r.path));
process.exit(1);
}
Reference the script in an eval case as verification_method: script-check with the mcp-script name. This is an extension point — the built-in methods cover most cases, but mcp-scripts handle project-specific behavioral assertions.
Persistence
Eval cases live in .evals/ in the working directory. The skill does not integrate with external memory backends in interactive sessions. For CI-side durable storage, see eval-creator-ci, which can optionally back its run history with gh-aw's repo-memory.
Skill Validation (skill-check)
The Anthropic /skill-creator skill includes two validation systems that eval-creator can use:
Structural validation via quick_validate.py
The skill-check verification method runs the skill-creator's quick_validate.py script on a skill directory. It checks:
- SKILL.md exists with valid YAML frontmatter
- Only allowed frontmatter keys (
name,description,license,allowed-tools,metadata,compatibility) - Name is kebab-case, max 64 chars, no leading/trailing/consecutive hyphens
- Description has no angle brackets, max 1024 chars
- Compatibility field max 500 chars if present
Eval case example:
---
id: eval-YYYYMMDD-NNN
pattern-key: skill-quality.verify-gate
verification_method: skill-check
target: skills/verify-gate
expect: valid
---
## What This Tests
Verify that the verify-gate skill passes structural validation after harness updates.
Execution: python .claude/skills/skill-creator/scripts/quick_validate.py <target>. Exit 0 = pass, exit 1 = fail.
Behavioral validation via run_eval.py
For deeper validation, the skill-creator's run_eval.py tests whether a skill's description causes Claude to invoke it for given queries. This is useful when harness-updater modifies a skill's description or the outer loop creates a new skill — the eval verifies the skill still triggers correctly.
This requires Claude CLI access and is expensive. Use it for high-value skills only, not as a routine CI check.
When to create skill-check evals
Two scenarios connect the outer loop to skill validation:
Harness-updater modifies a skill: When a promoted rule is inserted into a SKILL.md (rather than a project instruction file), create a
skill-checkeval to verify the skill remains structurally valid after the edit.Self-improvement identifies a skill gap: When learning-aggregator classifies a pattern as
skill_gapand recommends "create a new skill", the new skill should passquick_validate.pybefore being committed. Create askill-checkeval for it that persists as a regression test.
This closes the loop: failure → learning → new/updated skill → eval verifies skill quality → regression prevents quality drift.
What This Skill Does NOT Do
- Does not fix regressions (reports them for the agent or human to fix)
- Does not promote learnings (that's harness-updater)
- Does not analyze patterns (that's learning-aggregator)
- Does not replace project test suites — evals test the harness, not the code