eval-run

name: eval-run description: Execute skill evaluation against test cases, score with judges, and report results. Requires eval.yaml (generated by /eval-analyze). Use when the user wants to test a skill, run eval, benchmark, compare models, detect regressions, check skill quality, or verify changes didn't break anything. Triggers on "run eval", "test the skill", "evaluate", "benchmark", "check for regressions", "how does my skill perform", "score the skill", "run the tests", "run my evals", "compare against baseline", "did I break anything", "test my changes". Also called by /eval-optimize for automated iterations. user-invocable: true allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent, Skill, AskUserQuestion

You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.

For the full data flow (dataset → workspace → execution → collection → scoring), see ${CLAUDE_SKILL_DIR}/references/data-pipeline.md. For tool interception mechanics, see ${CLAUDE_SKILL_DIR}/references/tool-interception.md.

Step 0: Parse Arguments and Load Config

Parse $ARGUMENTS:

Argument	Required	Default	Description
`--config <path>`	no	auto-discover	Path to eval config
`--model <model>`	no	`models.skill` from config	Skill model. Required if `models.skill` is unset in eval.yaml.
`--subagent-model <model>`	no	`models.subagent` → falls back to skill model	Model for subagents (e.g., `claude-sonnet-4-6` while main is `claude-opus-4-7`)
`--skill <name>`	no	from config	Override the skill to test
`--run-id <id>`	no	`YYYY-MM-DD-<model>`	Identifier for this run
`--cases <id> [<id> ...]`	no	all cases	Exact case IDs to run
`--baseline <run-id>`	no	—	Previous run to compare against
`--no-llm-judges`	no	false	Skip LLM judges (prompt, prompt_file, LLM builtins). Run deterministic judges (check, Python builtins, external code).
`--gold`	no	false	Save outputs as gold references after run
`--effort <level>`	no	`runner.effort` from config	Claude Code reasoning effort (Claude Code only; ignored by other runners)
`--runner <type>`	no	local	`local` (default Steps 1–8) or `harbor` (containerized — skips to Harbor runner section)
`--env <name>`	no	`kubernetes`	Harbor execution environment: `podman`, `kubernetes`, `openshift` (only with `--runner harbor`)

If --runner harbor: after config discovery, skip to the Harbor runner section below. Steps 2–6 are replaced by one run.py call.

Config Discovery

If --config was explicitly provided, use that path directly.

Otherwise, auto-discover eval configs:

python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py

1 config found: auto-select it as <config>
Multiple configs found: present the list and ask the user which to run
No configs found: proceed to the bootstrap flow below

After selecting a config, read its skill field to set <eval-name> (used in $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> paths below).

Check if the resolved config file exists:

test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"

If config is missing: invoke eval-analyze to bootstrap:

Use the Skill tool to invoke /eval-analyze [--skill <skill>]

Once config exists, read it to understand the eval setup — the skill under test, runner, dataset, outputs, judges, models, and any tool interception. The downstream scripts read the same config; you don't need to pass these fields through, just confirm they're present and warn the user about anything missing or surprising.

If inputs.tools has entries but the skill uses AskUserQuestion or external APIs, verify the handlers cover those tools. Warn the user if a tool the skill uses isn't intercepted — headless execution may hang.

Persist parsed flags:

mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs}
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \
  model=<model> skill=<skill> run_id=<id> baseline=<baseline> \
  gold=<true/false> no_llm_judges=<true/false>

Step 1: Find Dataset

Read dataset.path from eval.yaml. Verify the directory exists and contains at least one case subdirectory:

ls <dataset_path>/ | head -20

If --cases was specified, pass the IDs to workspace.py as --cases <id> <id> ....

If no cases found, stop and tell the user clearly:

What path was checked
That they need test cases there
Suggest running /eval-dataset to generate test cases, or /eval-analyze --update to reconfigure the dataset path

Step 2: Preflight Check

Before setting up the workspace, verify the project's artifact directories are clean. Skills write to the project directory (not the workspace), so stale artifacts from previous runs contaminate results — wrong IDs, stale run reports, inflated file counts.

python3 ${CLAUDE_SKILL_DIR}/scripts/preflight.py \
  --config <config> \
  [--run-id <id>] \
  [--baseline <baseline-id>]

The script checks tmp/ state files, whether $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> already has results from a previous run, and (if --baseline is given) that the baseline run-id exists under the same eval-name directory.

If CLEAN: proceed to workspace setup.
If DIRTY: report the findings to the user and ask what to do:
- Force clean: run preflight.py --clean --force to delete all stale artifacts, then proceed.
- Change run-id: append a version suffix (e.g., 2026-04-11-opus-v2) and re-check. This avoids overwriting previous run results but still requires cleaning project artifacts — re-run preflight with --clean and the new run-id.
- Abort: let the user handle cleanup manually.
If MISSING_BASELINE (exit 2): the baseline run-id wasn't found at $AGENT_EVAL_RUNS_DIR/<eval-name>/<baseline>. The script lists nearby run-ids — confirm the correct one with the user (typo, or did they mean a different date/variant?) before retrying.

Step 3: Prepare Workspace

Create an isolated workspace with the test cases and output directories:

python3 ${CLAUDE_SKILL_DIR}/scripts/workspace.py \
  --config <config> \
  --run-id <id> \
  [--cases <id> [<id> ...]]

The script prints WORKSPACE: <path>, CASES: <count>, BATCH: <path>. Report these to the user. If inputs.tools is configured, it also prints HOOKS: N tool interceptors configured.

If the case count is 0, stop — the filter matched nothing.

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

If eval.yaml has inputs.tools entries, this step is mandatory. workspace.py emits a skeleton in tool_handlers.yaml; you must resolve each handler's prompt into concrete runtime checks (input_filters, env_checks, case_overrides). Do not skip this even when the eval.yaml is unchanged — the workspace is created fresh each time.

Read ${CLAUDE_SKILL_DIR}/references/tool-interception.md for the full format, field reference, and resolution examples. Then read <workspace>/tool_handlers.yaml and for each handler:

Identify type: AskUserQuestion, Bash, or MCP tool
Add required fields: input_filters for Bash, env_checks for services, case_overrides for deterministic answers
Verify: every Bash handler has input_filters — without them the handler is non-functional

Write the resolved handlers back to tool_handlers.yaml.

Step 4: Execute Skill

Run the skill headlessly against test cases. In case mode (default), execute.py runs the skill once per case with case-specific arguments and workspace — each case gets its own stdout.log and subagent transcripts. In batch mode, all cases run in a single invocation via batch.yaml.

If hooks: is configured in eval.yaml, execute.py automatically runs lifecycle hooks at the appropriate points: before_all before any case executes, before_each/after_each around each case execution, and after_all after all cases complete (guaranteed, even on failure). Hook logs are written to $AGENT_EVAL_RUNS_DIR/<id>/hooks/.

python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \
  --config <config> \
  --workspace <workspace_path> \
  --skill <skill_name> \
  --skill-args "<skill arguments>" \
  --model <model> \
  --output $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> \
  --run-id <id> \
  [--agent <runner>] \
  [--subagent-model <model>] \
  [--mlflow-experiment <name>] \
  [--effort <level>] \
  [--parallelism <n>]

Launch with run_in_background: true (no pipes). Monitor via tail -20 <output_file>. After completion, check run_result.json — if exit_code is non-zero, report the failure and stop. See ${CLAUDE_SKILL_DIR}/references/execution-monitoring.md for CLI flag fallbacks, monitoring patterns, and problem detection.

Step 5: Collect Artifacts

Distribute workspace outputs into per-case directories so judges can score each case independently:

python3 ${CLAUDE_SKILL_DIR}/scripts/collect.py \
  --config <config> \
  --workspace <workspace_path> \
  --output $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>

Read the collection summary (JSON file — read it with cat or jq, not state.py):

cat $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/collection.json

Report per-case counts. If any case has 0 artifacts, warn — the skill may not have produced output for that case.

Step 6: Score

Run all configured judges against the collected outputs. Four judge types are supported: builtin (reusable from the harness library), inline check (Python snippets), LLM prompt/prompt_file (Jinja2 rendered), and external module/function. All support optional arguments: for parameterization.

If --no-llm-judges was specified, skip judges that make LLM API calls (prompt, prompt_file, and LLM builtins). Run deterministic judges only (check, Python builtins, external code).

python3 ${CLAUDE_SKILL_DIR}/scripts/score.py judges \
  --run-id <id> \
  --config <config> \
  --workspace <workspace_path> \
  --model <model>

If hooks.before_scoring is configured in eval.yaml, score.py runs those hooks before judge execution. Pass --workspace and --model so hook environment variables are populated.

Judges receive a record dict with:

File contents: outputs["files"], outputs["<dir>_content"]
Execution metadata: outputs["exit_code"], outputs["duration_s"], outputs["cost_usd"], outputs["num_turns"] (if traces.metrics enabled)
Tool calls: outputs["tool_calls"] (if outputs has tool: entries)
Logs: outputs["stdout"], outputs["stderr"] (if traces.stdout/stderr enabled)
Annotations: outputs["annotations"] — parsed annotations.yaml from the dataset case directory (always present, empty dict if no file). Use for outcome-aware scoring where expected results depend on the test case.

This means judges can check output quality, execution efficiency, AND expected outcomes from annotations.

If --baseline was specified, also run pairwise comparison:

python3 ${CLAUDE_SKILL_DIR}/scripts/score.py pairwise \
  --run-id <id> \
  --baseline <baseline_id> \
  --config <config>

Read the full results:

python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/summary.yaml

summary.yaml has three sections: judges (per-judge mean and pass_rate), per_case (per-case {value, rationale} per judge), and pairwise (only if --baseline was used: run_a, run_b, wins_a, wins_b, ties).

Step 7: Interpret and Report

Read the summary and analyze the results. Read ${CLAUDE_SKILL_DIR}/prompts/analyze-results.md for the full analysis framework and output format. Lead with Recommendation (self-contained — many readers only read this section). Save the analysis to $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/analysis.md with YAML frontmatter (agent, model, date) — see the prompt file for the template.

Generate HTML report:

python3 ${CLAUDE_SKILL_DIR}/scripts/report.py \
  --run-id <id> \
  --config <config> \
  [--baseline <baseline_id>] \
  --open

Tell the user the report is at $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/report.html.

If --gold flag: After scoring, copy collected artifacts to dataset case dirs as reference files. Report which cases were saved.

Suggest next steps (include --config <config> if a non-default config was used):

/eval-review --run-id <id> for interactive human review of the results
/eval-optimize --model <model> for automated improvement based on failures
/eval-mlflow --run-id <id> to log results to MLflow

Step 8: Log to MLflow (optional)

If mlflow.experiment is configured in eval.yaml:

Use the Skill tool to invoke /eval-mlflow --action log-results --run-id <id> --config <config>

Harbor runner (`--runner harbor`)

When --runner harbor is specified, skip Steps 2–6 and call run.py instead — it handles task generation (or reuse), harbor run, per-case judging (in-container), result mapping, and report generation in one call:

PYTHONPATH="$(pwd)" python3 -m agent_eval.harbor.run \
    --config <config> --model <model> \
    --output $AGENT_EVAL_RUNS_DIR/<eval-name>/<run-id> \
    --tasks-dir <tasks-dir> --jobs-dir <tmp-jobs> \
    [--image <image>] [--agent <agent>] [--n-concurrent N] \
    [--env kubernetes]

Cluster-specific config (namespace, credentials secret) is read from a .env file in the project root. Create it with AGENT_EVAL_K8S_NAMESPACE and AGENT_EVAL_K8S_CREDENTIALS_SECRET — run.py loads it automatically.

Tasks come from /eval-dataset (which emits Harbor task packages via scripts/harbor.py). If --tasks-dir already has them, run.py reuses them; if empty, it generates on the fly (needs --image). The output is a standard run_result.json + summary.yaml + report.html — then continue with Step 6 pairwise (if --baseline) and Step 8 (MLflow) as normal. See deploy/harbor/README.md for image build, credentials, and environment setup.

EvalHub runner (`--runner evalhub`)

When --runner evalhub is specified, skip Steps 2–8 and call the EvalHub runner instead — it creates ConfigMaps, submits a job to EvalHub, polls for completion, and maps the results back:

python3 -m agent_eval.evalhub.runner \
    --config <config> --model <model> \
    --output $AGENT_EVAL_RUNS_DIR/<eval-name>/<run-id> \
    [--evalhub-url <url>] [--namespace <ns>] [--project-dir <path>]

The runner creates K8s ConfigMaps for the eval config and project resources (via k8s_resources), submits the job to EvalHub (which creates a Job pod running the adapter in-process), polls until completion, and maps the results into the standard summary.yaml + report.html. No image rebuild needed — ConfigMaps carry the project-specific content.

Rules

Never read large artifact files into your context — delegate content analysis to agents. The summary.yaml has everything you need for reporting.
Persist state at every step — use python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py so flags and results survive context compression.
Report progress at each step so the user knows what's happening and how long it's taking.
Fail fast — if execution fails, report it immediately. Don't continue to scoring with no artifacts.
Be decisive in analysis — the user wants to know what's wrong and what to do about it, not a list of possibilities.

$ARGUMENTS