name: run-benchmark description: Configure and launch CodeScaleBench runs with current paired-run and curation guardrails.
Skill: Run Benchmark
Scope
Use this skill when the user asks to run benchmark suites, rerun failures, or launch official/gap-fill batches in CodeScaleBench.
Approval Gate (Required Before Running)
Before executing any benchmark run, ask the user to confirm:
- Model — which model? (e.g.
anthropic/claude-haiku-4-5-20251001for test runs) - Suite / selection file — which benchmark suite or
--selection-file? - Config — paired (default),
--baseline-only, or--full-only? Which--full-config? - Parallel slots — how many? (default: 1; use 8 for multi-account runs)
- Category —
staging(default) orofficial?
Do NOT launch a run until the user has confirmed these five parameters.
Canonical Commands (Current)
- Per-suite default:
./configs/<suite>_2config.sh - Unified selected-task runner:
./configs/run_selected_tasks.sh - Config registry:
configs/eval_matrix.json - Do not assume
*_3config.shrunners exist.
Run Policy (Mandatory)
- Default execution is paired by task:
baseline+sourcegraph_full. - Single-lane runs are gap-fill only:
--baseline-onlyrequires valid existingsourcegraph_fullcounterpart runs.--full-onlyrequires valid existingbaselinecounterpart runs.
- Emergency bypass only:
ALLOW_UNPAIRED_SINGLE_CONFIG=true.
Standard Launch Patterns
# Paired per-suite run
./configs/pytorch_2config.sh --parallel 4
# Paired selected-task run
./configs/run_selected_tasks.sh --benchmark csb_sdlc_pytorch
# Gap-fill baseline only (guarded)
./configs/run_selected_tasks.sh --benchmark csb_sdlc_pytorch --baseline-only
# Gap-fill full only (guarded)
./configs/run_selected_tasks.sh --benchmark csb_sdlc_pytorch --full-only
Preflight Checks
Before launching:
python3 scripts/check_infra.py- Ensure
SOURCEGRAPH_ACCESS_TOKENis set for MCP-enabled lanes. - Ensure Claude credentials exist in
~/.claude/.credentials.json(or configured multi-account homes).
Post-Run Curation (Before Analysis)
python3 scripts/quarantine_invalid_tasks.py --execute
python3 scripts/generate_manifest.py --require-triage --fail-on-unknown-prefix
python3 scripts/validate_official_integrity.py --runs-dir runs/official --check-mcp-trace-health
Analysis Entrypoints
python3 scripts/audit_traces.py --json
python3 scripts/cost_report.py
python3 scripts/generate_eval_report.py --runs-dir runs/official --output-dir eval_reports
Output Expectations
Runs land under runs/<category>/<run_dir>/ with per-config task directories and flagged_tasks.json.
Use runs/official/MANIFEST.json as the authoritative curated inventory.
Failure Handling
- If a single-lane run is blocked by guardrails, run paired mode unless this is confirmed gap-fill.
- If counterpart validity is unclear, regenerate and validate manifest first.