eval-dataset

name: eval-dataset description: Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found." user-invocable: true allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion

You generate evaluation test cases for a skill. You read the skill analysis (eval.md) and eval config (eval.yaml) to understand what the skill does, then create realistic test cases that match the dataset schema. The goal is giving /eval-run something meaningful to test against.

Step 0: Parse Arguments

Argument	Required	Default	Description
`--config <path>`	no	auto-discover	Path to eval config
`--count <N>`	no	5	Number of cases to generate
`--strategy <type>`	no	`bootstrap`	Generation strategy (see Step 3)
`--run-id <id>`	no	—	Previous eval run to learn from (used with `expand`)
`--harbor`	no	—	Also generate Harbor task packages (Step 8)
`--image <image>`	with `--harbor`	—	Container image for Harbor task packages

Config Discovery

If --config was explicitly provided, use that path directly. Otherwise, auto-discover:

python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py

1 config found: auto-select it as <config>
Multiple configs found: present the list and ask the user which eval's dataset to populate
No configs found: suggest running /eval-analyze first

Step 1: Read Context

Read eval.yaml and eval.md to understand:

The skill — what it does, what inputs it expects, what it produces
The execution config — execution.mode (case or batch) and execution.arguments (the argument template). In case mode, {field} placeholders in the arguments are resolved per case from input.yaml — every field referenced in the template (e.g., {strat_key}, {prompt}) must exist in the generated input.yaml files.
The dataset schema — dataset.schema describes the case structure (files, fields, formats)
The dataset path — where cases should be created
The output schema — outputs[*].schema describes what the skill produces (informs what reference outputs look like)
The judges — extract the evaluation criteria from each judge. The approach depends on the judge type:
- builtin judges have predefined criteria — list them with python3 ${CLAUDE_SKILL_DIR}/../eval-analyze/scripts/list_builtins.py. Use the builtin's known behavior to inform case design (e.g., cost_budget → include a large-input case that tests cost scaling; output_completeness → include a case with many requirements)
- check snippets reveal exact validation logic — what fields are accessed, what thresholds are used, what conditions trigger pass/fail
- prompt / prompt_file text describes quality dimensions (completeness, accuracy, etc.)
- description summarizes what each judge evaluates

Build a list of judge-driven requirements — these are the concrete things judges will check. Each test case should be designed to exercise at least one of these requirements. For example:

A judge checking len(content) >= 100 → include a case with minimal input that might produce short output
A judge comparing against a reference → include a case where the correct answer is unambiguous
A judge checking tool calls → include a case where the skill should (or shouldn't) invoke external tools
A cost/efficiency judge → include a case with large input that tests scaling

If eval.yaml doesn't exist, ask the user which skill to evaluate, then invoke /eval-analyze to create the config:

Use the Skill tool to invoke /eval-analyze --skill <skill-name>

Wait for the analysis to complete, then re-read eval.yaml. If /eval-analyze fails or the user skips it, you cannot generate meaningful cases — stop and explain why.

If eval.md doesn't exist, you can still work from eval.yaml's schema descriptions, but the cases will be less targeted.

Assess recommended case count

After reading the skill analysis and judges, estimate whether --count is sufficient. Count the skill's distinct execution paths (branches, modes, optional steps), the number of judges, and the number of conditional judges. A rough guideline: you need at least one case per execution path, plus enough variety for each judge to have both passing and failing examples. If the skill has 4 execution paths and 6 judges, 5 cases may be thin — suggest a higher count to the user ("This skill has N distinct paths and M judges — consider --count 12 for better coverage").

Step 2: Parse Schema into Generation Template

Read dataset.schema and extract a concrete checklist:

Required files — what files each case directory must contain (e.g., input.yaml, reference.md)
Required fields per file — for structured files like YAML/JSON, which fields are mandatory
Optional fields — fields described with "optionally" or "if available" — vary these across cases (include in some, omit in others) to test the skill's handling of missing optional context
Field semantics — what kind of content each field expects (e.g., "problem statement", "clarifying context", "priority level"). Use these descriptions to generate realistic content, not generic placeholders
Naming patterns — any file naming conventions mentioned (e.g., "named NNN-slug.md")
Argument fields — if execution.mode is case, parse execution.arguments for {field} placeholders. Every placeholder must appear as a required field in input.yaml. Cross-check against items 1-2 above — if {strat_key} is in the arguments but not in the schema, add it as a required field.
External-state fields — look for fields marked with [EXTERNAL: System] in the schema description. These reference real resources in external systems (Jira projects, GitHub repos, API endpoints) that must exist at execution time. Do NOT invent values for these fields — fabricated values (e.g., a Jira project key derived from the repo directory name) cause silent failures when the skill queries the external system and gets zero results. Mark these in your generation template as requiring TODO_ placeholder values (see Step 5).

This checklist is your generation template. Every case must satisfy items 1-2 and 6. Items 3-4 guide content variety.

Step 3: Assess Current State

Check what already exists:

ls <dataset_path>/ 2>/dev/null | head -20

Count existing cases and read one to understand the current structure. Note:

How many cases exist
What topics/scenarios they cover
Any obvious gaps (only simple cases? no edge cases? no error scenarios?)

Step 4: Choose Strategy

bootstrap (default) — Generate N cases from scratch. Use this when starting from zero or when fewer than 5 cases exist.

Design cases to cover:

1 simple case — straightforward input, expected to pass all judges easily
1 complex case — longer input, multiple requirements, tests the skill's full capability
1 edge case — unusual input that tests boundaries (very short, very long, ambiguous, missing fields)
Remaining cases — map to the judge-driven requirements from Step 1. Each remaining case should target a specific judge criterion that the first three cases don't already stress. If there are more judge criteria than remaining case slots, prioritize the strictest judges (those with high thresholds or binary pass/fail).

expand — Read existing cases, identify gaps, generate cases that fill them. Use this when cases exist but coverage is thin.

Read each existing case's input file to understand what's already covered. Then look for gaps by comparing against:

The skill's documented capabilities (from eval.md)
The judges' criteria (from eval.yaml — what do judges check that no case tests?)
Edge cases mentioned in the skill analysis
Input variety (all cases similar? need different lengths, complexities, topics)

If --run-id was provided, also read the eval results to target empirical failure patterns:

python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<run-id>/summary.yaml

Use the results to prioritize new cases:

If a judge consistently fails across cases, add cases that isolate its specific criterion
If a judge scores low on simple inputs but passes on complex ones (or vice versa), add cases that explore that boundary
If certain case types have no failures, don't add more of the same — focus on the weak spots

Avoid duplicating existing scenarios — each new case should test something distinct that isn't already covered. Number new cases continuing from the highest existing case number.

from-traces — Extract real inputs from MLflow traces and turn them into test cases. Use this when the skill has been used in production and traces are available.

Run the extraction script:

python3 ${CLAUDE_SKILL_DIR}/../eval-mlflow/scripts/from_traces.py \
  --config <config> \
  --count <N>

This outputs YAML with extracted trace inputs (prompt text, tool interactions). Read the output and create case directories following the generation template from Step 2. The trace inputs give you realistic content for the input fields — but you still need to structure the files according to dataset.schema.

If the script exits with code 2 (no traces found) or MLflow is not configured, tell the user and fall back to expand strategy.

Step 5: Generate Cases

For each case, create a directory under dataset.path following the structure described in dataset.schema.

Naming: Use descriptive directory names that indicate what the case tests:

case-001-simple-basic-input/
case-002-complex-multi-requirement/
case-003-edge-empty-context/
case-004-long-detailed-input/
case-005-ambiguous-phrasing/

Content: Use the generation template from Step 2. Every case must include all required files and fields. Vary optional fields across cases — include them in some, omit in others. Use the field semantics to generate realistic content appropriate to each field's purpose.

Realism: Cases should look like something a real user would encounter. Don't generate lorem ipsum or obviously synthetic inputs. Use realistic names, scenarios, and domain language appropriate to the skill.

External-state placeholders: For fields marked [EXTERNAL: System] in the schema, use TODO_<SYSTEM>_<FIELD> as the value (e.g., project_key: "TODO_JIRA_PROJECT_KEY"). If you want to show a plausible real value, put it in a YAML comment (e.g., # replace with real key, such as MYPROJECT). The TODO_ prefix signals that this must be replaced with a real value from the target system before execution. List all placeholders in Step 7 so the user knows what needs manual review.

Answers, annotations, companion files, and reference outputs: See ${CLAUDE_SKILL_DIR}/references/case-generation.md for answers.yaml (interactive skills), annotations.yaml (outcome-aware judges with if conditions), companion files, and reference output guidance.

Step 6: Validate

After generating, verify the cases:

Read one generated case back and check it matches the schema
Count files per case — do they match what dataset.schema describes?
If execution.mode is case, verify that input.yaml contains all fields referenced by {field} placeholders in execution.arguments
If companion files are expected, verify they exist in each case directory
Check for obvious issues (empty files, placeholder text, wrong field names)
If judges have if conditions referencing annotations, verify that the generated cases cover both branches — at least one case where the condition is true and one where it's false. Warn if any conditional judge would never run (or always run) across the entire dataset.

ls <dataset_path>/case-001-*/

Step 7: Report

Tell the user what was created:

Cases generated: N new cases at <path>
Strategy used: bootstrap / expand / from-traces
Coverage: What scenarios are now covered (simple, complex, edge cases)
What's missing: Reference outputs (if not generated), any gaps still remaining
External-state placeholders: If any TODO_ placeholder values were generated, list each one with which case it's in, which external system it references, and what kind of value is needed (e.g., "case-001/input.yaml TODO_JIRA_PROJECT_KEY — needs a real Jira project key from your test instance"). These MUST be replaced with real values before running /eval-run.
Next steps (include --config <config> if a non-default config was used):
- /eval-run --model <model> to test the skill against these cases
- /eval-run --model <model> --gold to generate gold references from the best outputs
- /eval-dataset --strategy expand --count 10 to add more cases later

Step 8 (if `--harbor`): Emit Harbor task packages

If --harbor was passed, generate self-contained task packages for containerized execution. Run:

python3 ${CLAUDE_SKILL_DIR}/scripts/harbor.py \
  --config <config> --out <dataset_dir>/../harbor-tasks --image <image> \
  [--judge-model <model>] [--verifier-timeout 900] [--agent-timeout 3600]

See ${CLAUDE_SKILL_DIR}/references/case-generation.md for details.

Rules

Match the schema exactly — if dataset.schema says "input.yaml with a 'prompt' field", create input.yaml with a prompt field. Not input.json, not query.yaml.
Realistic over synthetic — cases should feel like real usage, not test scaffolding
Cover the skill's range — don't just generate 5 variations of the same simple input. Test different capabilities the skill claims to have.
Don't fabricate gold outputs — if you're not confident in what a correct output looks like, leave the reference out. Wrong references are worse than no references.
Name cases descriptively — case-003-edge-empty-context is better than case-003. The name should indicate what scenario is being tested.
Start small — 5 well-designed cases beat 50 random ones. Quality over quantity, especially for the first dataset.

$ARGUMENTS