skill-evaluator - SKILL.md Agent Skill

name: skill-evaluator description: Evaluate Codex or agent skills with realistic prompts, raw model outputs, token and context cost, source verification, hallucination risk, and anti-overfitting checks. Use when creating, revising, or deciding whether to ship an installable skill, especially when testing behavior across strong and weaker models or checking whether a skill generalizes beyond its eval cases.

Skill Evaluator

Use this skill to test whether another skill changes agent behavior in the intended way without hiding answers in the test setup. Measure correctness, evidence quality, token cost, and generalization.

Workflow

Identify the skill under test and its intended behavior from SKILL.md.
Create eval artifacts outside the installable skill first, usually under /tmp/<skill-name>-eval/ or a repo-local non-shipped eval folder.
Write 4-8 realistic prompts:
- Include ordinary success cases.
- Include at least one ambiguous or under-specified case.
- Include one adjacent anti-overfit case that uses the same behavior pattern but different nouns, APIs, files, or examples.
- Include one prompt that should force source lookup instead of confident memory.
Run each prompt in a fresh context when feasible. Compare at least one target model against a baseline without the skill when the cost is reasonable.
Grade raw outputs before editing the skill. Do not patch the skill from a single narrow failure unless the fix is a general rule.
If the skill changes, rerun the failed prompt and one adjacent anti-overfit prompt.
Summarize pass/fail, token cost, loaded context, sources used, and remaining risks.

What To Measure

Load references/evaluation-matrix.md when you need a rubric or report format.

Always record:

Correctness: Did the output solve the task?
Evidence discipline: Did exact claims cite commands, local files, URLs, or observed output?
Hallucination resistance: Did the model refuse to guess APIs, errors, paths, flags, or version-sensitive facts?
Freshness behavior: Did the model admit when its knowledge may be stale and choose a current source for new features, ecosystem packages, APIs, command flags, or provider behavior?
Token and context cost: Which skill files and references were loaded, and how many tokens or rough context units were consumed?
Tradeoff: Did the skill spend extra tokens to prevent a real correctness or hallucination failure, and is that cost justified for this task class?
Generalization: Did the behavior hold on adjacent cases, or only on the named test prompt?
Non-regression: Did a fix make normal tasks more verbose or less useful?

Runner Selection

Load references/runners.md before invoking CLI runners, weak/stale models, or networked model providers.

Prefer:

One strong model to check the skill can succeed.
One weaker, older, or more hallucination-prone model to expose missing guardrails.
Fresh sessions with minimal prompt-local context.

Do not store credentials in eval artifacts. Do not include expected answers in the prompt unless the eval is explicitly grading transformation of known content.

Anti-Overfit Rules

Do not add direct links or facts only because a single eval mentioned them.
Prefer reusable behavior rules: when to verify, what source type to use, how to structure the answer, and what not to claim.
Add domain facts only when they are broadly useful, stable, and likely to reduce repeated failures.
After each skill patch, add or run an adjacent case that cannot be solved by matching the previous test wording.
Keep eval cases outside installable skill folders until they are clearly worth shipping as examples or scripts.

Report Shape

Keep the final report short and decision-oriented:

## Result
Pass / partial / fail.

## Runs
- model, prompt id, skill mode, token/cost data, raw output path

## Findings
- correctness issues
- evidence or hallucination issues
- token/cost issues
- overfit or generalization issues

## Changes
- skill edits made, if any
- retest result