name: run-eval description: Run and analyze CLISpecBench evals like WordCount, RS274, and IGES. Use when launching evals, detaching background runs, monitoring progress, inspecting transcripts and result JSON, classifying failures, or preparing official results tables and Last Message summaries.
Meta: This file is a breathing document. If you read it and find that any of the following documentation or guidance is out of date or you find a way to do any of the following in a strictly better or more efficient way, please update it.
Run eval
Docker prerequisites
- On Windows, use the WSL2 daemon (this is the repo-supported setup for this host).
- Make sure Docker Engine is started in WSL and reachable from Windows:
wsl -d Ubuntu -- service docker status- If needed:
wsl -d Ubuntu -- sudo service docker start DOCKER_HOST=tcp://localhost:2375 docker version
- If the run cannot connect to Docker:
- restart Docker in WSL:
wsl -d Ubuntu -- sudo service docker restart - then re-run
DOCKER_HOST=tcp://localhost:2375 docker version
- restart Docker in WSL:
- Proceed with eval commands only after a successful daemon check.
Launching runs
- There is no timeout flag. Let agent sessions exit naturally; killing them early depresses scores and wastes compute.
- The harness has a 24-hour safety backstop for hung containers; treat that as an emergency stop, not a normal control.
- Prefer detached runs over blocking the shell. Monitor
progress.txtin the eval directory as runs complete, and also check container status when you need to tell whether work is still active. - Use
clispecbench run ...if the console script is available; otherwise useuv run clispecbench run .... - Use
clispecbench resultsoruv run clispecbench resultsto inspect aggregate run output after completion. - On Windows + WSL2 Docker, authenticate
claude,codex, andgeminion Windows, not inside WSL. Credential paths underC:\Users\<you>\.claude,.codex, and.geminiare translated to/mnt/c/...for the WSL daemon;scripts/smoke-test-*.shis the source of truth for the mount strategy. - Claude Desktop sign-in does not guarantee headless
claude --printeval auth. Refresh Claude Code CLI credentials on Windows with/loginorclaude auth login, then runscripts/smoke-test-claude.sh. For long unattended queues,claude setup-tokencan generate a one-yearCLAUDE_CODE_OAUTH_TOKEN; use it only if the launcher can keep the token out of logs/history.
Example:
DOCKER_HOST=tcp://localhost:2375 uv run clispecbench run --task rs274-cpp --agent codex-cli --model gpt-5.2-codex --effort xhigh --runs 3
- On this Windows host, do not background evals via
Start-Process powershell ...; use a hiddencmd.exewrapper instead. - Tested detached launcher pattern on this host:
Start-Process cmd.exe -WindowStyle Hidden -ArgumentList '/d','/c','cd /d C:\Git\CLISpecBench && set DOCKER_HOST=tcp://localhost:2375 && uv run clispecbench run --task <task> --agent <agent> --model <model> [optional flags] 1>"<out.log>" 2>"<err.log>"'
- Do not rely on raw
dockerfrom PowerShell on this host; the harness uses the Python Docker SDK.
Post-run inspection
- After every eval run completes, inspect the result JSON and
transcript.jsonlbefore moving on. Do not batch review later. - Use the
metadata.notesfield in the result JSON for root-cause observations when the file is still editable. - For every run, classify the failure-mode bucket for
metadata.exit_class. The bucket determines whether the run counts toward Best/Mean (see Reporting rules). Two top-level groups:completed— the agent ran to its own self-terminated end. Score reflects what the agent built. Always included.- Model-side failures (prefix
model_*) — the run produced a real, scorable submission but the agent CLI exited via something other than its own completion path. Score still reflects model behavior. Included in Best/Mean unless the submission was empty/stub. Sub-buckets:model_timeout: agent was actively working when killed by the 24h backstop or a local timeout. Note whether source files exist and whether they build.model_context_exhausted: agent hit context-window limits; note how far it got.model_no_code: agent completed voluntarily but never wrote source files. Check whether it only planned or analyzed; flag as a false-completion if it claimed it shipped code.model_build_failure: agent wrote source but it does not compile.model_agent_error: agent crashed or threw an unhandled exception.model_output_cap: agent's session ended because a single model response exceeded the per-message output-token ceiling. Diagnostic signal:agent_last_messagecontains "Claude's response exceeded the NNN output token maximum" (orCLAUDE_CODE_MAX_OUTPUT_TOKENStext). Distinct frominfra_usage_cap— this is a model-behavior signal (the agent chose to emit a single huge response instead of chunking or stopping); the user's billing tier has nothing to do with it. Kept and included. NOTE: the claude-code adapter now setsCLAUDE_CODE_MAX_OUTPUT_TOKENS=1000000(the CLI clamps to each model's true max), removing the CLI's default 32k truncation — so this bucket should now be rare for claude-code. It was introduced because Opus 4.8 at--effort maxhit the old 32k default on every run (dying mid-task with stub submissions). If you still see it, the model genuinely emitted a single response exceeding its own max output.
- User/environment failures (prefix
infra_*) — the run was cut off by something outside the model's control. Score does not measure model capability. Always excluded from Best/Mean and flagged for rerun, with no exceptions for high scores or long wall-times — the agent didn't get a fair chance to finish, so the result is not a model-capability data point. Sub-buckets:infra_usage_cap: agent hit an Anthropic/operator-side usage cap, daily quota, or session limit. Diagnostic signal:agent_last_messagecontains "you've hit your limit" / "resets at X" / similar cap-reset text, OR the agent CLI exited 1 mid-session after long wall-time with no completion claim. This bucket replaces the oldmodel_capped. Even if the agent ran for 30 min, produced a working build, and scored 0.85, this is still scrapped — the cap belongs to the operator's billing tier, not the model, and recording it would inflate the model's failure rate with non-model signal.infra_auth: 401/403 errors or expired credentials; agent never started real work. Fast-fail signature (wall under a minute, no source files written).infra_rate_limit: 429s or platform-side quota exhaustion encountered before any work happened. Same fast-fail signature asinfra_auth.infra_other: server errors, capacity errors, container startup crashes, or any other host/network issue not attributable to the model.
- How to distinguish a usage cap from anything else: the agent CLI's "you've hit your limit" string is definitive for
infra_usage_cap(account billing, scrap) — readmetadata.agent_last_message. The string"Claude's response exceeded the 32000 output token maximum"ismodel_output_cap(model behavior, keep) and must not be confused with the billing cap. Account-cap messages reference a reset time ("resets 5:30pm UTC"); the output-cap message references theCLAUDE_CODE_MAX_OUTPUT_TOKENSenv var. - Mid-session infra failures are NOT model signal even when partial scores look real. An
infra_*classification is independent of wall-time, tool-call count, or partial test score. If the agent ran 17 minutes, made 14 tool calls, and the harness extracted a submission scoring 89/220 — but the agent's last message is"Failed to authenticate. API Error: 401 ..."or"API Error: Unable to connect to API (ConnectionRefused)"— the score reflects WHEN the operator's token expired or the network blipped, not what the model could do. Scrap and rerun. The publish CLI's stoplist (publish._classify_unpublishable_stop_message) catches the known signatures; when a new mid-session infra signature appears, add it there so it can't be hand-published by mistake. - Silent model fallback (served ≠ requested). The agent CLI can accept a
--modelit doesn't recognize and silently run its default model instead — no warning, no error, exit 0. This happened with the pinned claude-code CLI (2.1.120) and the deprecated 4.0 snapshots: requestingclaude-opus-4-20250514/claude-sonnet-4-20250514/claude-opus-4-1-20250805was servedclaude-opus-4-7, producing ~86% scores that matched Opus 4.7, not the older model. The harness guards this: the runner recordsmetadata.served_modelfrom the transcript'ssystem/initevent and, on a positive mismatch with the requested model, forcesexit_reason="error"and writes aMODEL MISMATCHnote;publish_resulthard-refuses to publish any run whereserved_modelis incompatible withmodel(seeresults.models_compatible). Aliases resolving to their dated snapshot (e.g.claude-opus-4-5→claude-opus-4-5-20251101) are compatible. - CLI generations under one
claude-codeagent. Rather than a separate agent id,claude-codespans multiple pinned CLI images via a variant (src/clispecbench/agents/claude_code.py): the default (2.1.x) and alegacy(2.0.2) image that still serves the 4.0-gen snapshots. The variant is chosen by--cli-version(default/legacyor a version string) or, by default, auto-routed by model — the snapshots in_LEGACY_MODELSgo to the legacy image automatically. The CLI version is recorded inmetadata.agent_version(2.0.2 vs 2.1.120) and surfaced on the dashboard (CC 2.0.2 / O-4.0vsCC 2.1.120 / O-4.6), so there is noclaude-code-legacyagent name. The legacy variant omits--effort(absent in 2.0.x) and setsMAX_THINKING_TOKENS=31999as its max-reasoning setting, and disables OTEL (token/cost come from the stream-json result event). To collect a new deprecated model, add its id to_LEGACY_MODELS; the served-model guard backstops a wrong mapping. Legacy 4.0-gen queues:scripts/_queue-legacy-rs274.sh(MODEL=… LABEL=… [SKIP=N]). - Record the bucket in
metadata.exit_classand a brief prose explanation inmetadata.notes(or report explicitly if the file is already finalized). - For every run that scores above zero, confirm from the transcript whether the agent acknowledged it was done, voluntarily exited, was still working when killed, asked for input but got none, or hit an error/rate limit partway through.
Reporting rules
- Every results report must include the language/task variant, the failure-mode bucket for every non-
completedrun, whether any included non-completedruns may have scored higher with more wall time, and any infrastructure-side failures requiring reruns. - A run is included in official per-run tables and Best/Mean calculations if its
exit_classiscompletedor starts withmodel_*ANDtest_summary.total > 0(tests actually ran). All other runs — anythinginfra_*, plus model-side runs where pytest never collected — are excluded and listed in a separate "Excluded runs" section. - Annotate every included non-
completedrun inline (e.g.0.747†with a footnote: "agent hitmodel_cappedat 3,076s; submission scorable"). The point is to make the reader see at a glance which numbers came from clean runs and which came from runs cut short. - If a cell has zero included runs, show
- | -for Best and Mean and explain the failure in the Status column. - The model preparing or publishing the results is responsible for the Last Message summary: read the full
metadata.agent_last_message, compare it against the run outcome and artifacts, and write a concise editorial summary instead of copying the first sentence. - Do not include file counts or LOC in the Last Message summary unless they are needed to explain a contradiction, false completion claim, or other anomaly; the per-run table reports those fields separately.
- Surface these signals when present:
- claims complete
- incomplete and acknowledged
- incomplete and asked to continue
- incomplete but not acknowledged
- asked a clarifying question
acknowledgedandasked to continueare not mutually exclusive; report both when both appear.- Flag false completion claims explicitly. Example: if the message says the simulator is working but the run produced 0 files or only stubs, say so.
Publishing rules
- Pass the bucket through to
clispecbench publishso the dashboard surfaces it. The canonical list of valid--statuslabels lives insrc/clispecbench/harness/status.py(VALID_STATUSES); the publish CLI rejects anything not in that set and auto-populatesmetadata.exit_classfromSTATUS_TO_EXIT_CLASS. The current mapping is:completed→--status "Complete"(or"Incomplete"if the agent acknowledged gaps)model_timeout→--status "Timeout"model_context_exhausted→--status "Context exhausted"model_no_code→--status "No code written"model_build_failure→--status "Build failure"model_agent_error→--status "Agent error"model_output_cap→--status "Output cap"
- Deprecated labels (e.g.
"Capped (model)"from before the model_capped → infra_usage_cap + model_output_cap split) are rejected with a migration hint. SeeDEPRECATED_STATUS_REPLACEMENTSfor the mapping. - Do not publish
infra_*runs — includinginfra_usage_cap. They contain no model-capability signal; delete the transient artifacts and rerun. This applies even when the agent did substantial work before the cap tripped: a capped run is the operator's billing tier showing up in the data, not the model, and publishing it would mislabel an operator failure as a model failure. The cap-hit transient (result.json,transcript.jsonl,source/) should be removed before relaunch so the harness doesn't accumulate stale eval directories. - When publishing a
model_timeoutrun, include a Last Message summary that explicitly notes the timeout and the wall-time at which it tripped, so the dashboard reader knows the agent was cut short and didn't choose to stop. - The dashboard reads
published_results/web/results-published.jsonandpublished_results/web/test-results-published.json, which are not updated byclispecbench publishitself. After publishing, regenerate them or the dashboard will silently drift behind the published files. Two ways:- One-shot publish: pass
--rebuild-dashboardtoclispecbench publishand the rebuild runs immediately. - Batch publish (multiple results in a row): omit the flag on each publish and run
clispecbench rebuild-dashboardonce after the whole batch — the rebuild walks every published file, so doing it per-publish is wasteful.
- One-shot publish: pass
- Commit both
results-published.jsonandtest-results-published.jsonalongside the newrunN.jsonfiles. They're regenerated artifacts but checked in so consumers don't have to rebuild before opening the dashboard.