name: benchmark-drilldown description: Use this skill when the user asks why a specific benchmark run failed for one model/question slice. phrases: "why did model X fail question Y", "drill down into this benchmark failure", "show me failure details for q6", "what happened in direct-question-06 for claude opus", "plot validation failure root cause"
benchmark-drilldown
Use this skill when the user asks why a specific benchmark run failed for one model/question slice.
Trigger phrases
- "why did model X fail question Y"
- "drill down into this benchmark failure"
- "show me failure details for q6"
- "what happened in direct-question-06 for claude opus"
- "plot validation failure root cause"
Inputs expected
results_dir: folder containingresults.csvand run artifacts.model: exact model string used in the run (fromresults.csv).question: single index (6) or inclusive range (6-9).- Optional:
failure_kind(runtime,validation,all). - Optional: output format (
markdown,table,yaml,json).
Exact command workflow
- Sanity-check required files
test -f <results_dir>/results.csv
- Locate optional catalogs (profile-local first, then repo-root fallback)
test -f <results_dir>/validation_errors.yaml || test -f validation_errors.yaml
test -f <results_dir>/error-catalog.yaml || test -f error-catalog.yaml
- (Optional) discover model names and question coverage
python - <<'PY'
import pandas as pd
from pathlib import Path
p = Path('<results_dir>/results.csv')
df = pd.read_csv(p)
print('models=', sorted(df['model'].dropna().unique().tolist()))
print('questions=', sorted(df['question'].dropna().unique().tolist()))
PY
- Run drilldown for one slice
python results/query-drilldown.py \
--results-dir <results_dir> \
--model "<model>" \
--question <question|range> \
--failure-kind <runtime|validation|all> \
--format <markdown|table|yaml|json>
- If failure_kind includes validation, summarize validation catalog
python results/query-validation-errors.py report \
--input <results_dir>/validation_errors.yaml \
--model "<model>" \
--question <question> \
--profile <profile_name>
If <results_dir>/validation_errors.yaml is missing, retry with --input validation_errors.yaml.
- If failure_kind includes runtime, summarize runtime catalog
python results/query-extract-failures.py report \
--input <results_dir>/error-catalog.yaml \
--model "<model>" \
--question <question> \
--profile <profile_name>
If <results_dir>/error-catalog.yaml is missing, retry with --input error-catalog.yaml.
- If catalogs are absent or stale, drill directly into per-trial artifacts
python - <<'PY'
import glob, yaml
paths = sorted(glob.glob('<results_dir>/*_iteration_status.yaml'))
print('found', len(paths), 'iteration status files')
for p in paths[:3]:
d = yaml.safe_load(open(p, encoding='utf-8'))
print(p, 'validated=', d.get('validated'), 'msg=', d.get('validation_message', ''))
PY
- If catalogs are absent or stale, regenerate them from iteration files (real LLM only)
python results/query-extract-failures.py analyze \
--output <results_dir>/error-catalog.yaml \
<results_dir>/*_iteration_status.yaml
python results/query-validation-errors.py analyze \
--output <results_dir>/validation_errors.yaml \
<results_dir>/*_iteration_status.yaml
Do not use --dummy-llm for real runs. Validation analysis must use the real LLM.
If OpenAI credentials are missing, stop and ask the user to configure the key before continuing.
- (Optional) compare generated code to correction/reference code for validation failures
results/query-validation-errors.py loads correction code from results/questions/Question XX.ipynb.
For question 16, correction code source is results/questions/Question 16.ipynb.
Argument patterns
- Question token pattern:
^\d+$or^\d+-\d+$. - Model argument must exactly match one value in
results.csv:model. results_dirshould be a direct path likeresults/atlas-sx-awk-hist.- Use
--profileonly when the catalogs contain multiple profiles. - If model filtering returns no rows, retry likely aliases (e.g.
gpt-5.2-codexandopenai/gpt-5.2-codex).
Output contract
Follow .codex/skills/_shared/templates/output-contract.md and include:
- Bullet summary of likely root cause(s).
- A failure table with
question/model/failure_kind/count/signature. - 1-3 key snippets from produced markdown/yaml/json outputs.
- Explicit rerun command(s) to validate a fix.
Failure handling
- Missing
results.csv: stop and report exact missing path. - Missing
validation_errors.yamlorerror-catalog.yamlunder<results_dir>: retry root-level catalog (validation_errors.yaml/error-catalog.yaml) before reducing confidence. - No rows for model/question slice: report empty slice and print nearest available model/question values.
- No rows after exact model match: retry model aliases with provider prefix/suffix variants.
- Missing
pandasduring optional discovery: skip discovery, run direct CLI commands, and state fallback. - Missing
yamldependency in reporting scripts: still runquery-drilldown.pyoutput and note report step skipped. - Missing OpenAI key during validation
analyze: stop immediately and ask the user to fix credentials; do not switch to--dummy-llmfor real runs. - Validation phase appears empty in drilldown output: inspect per-trial
validatedandvalidation_messagefields directly from*_iteration_status.yamlfiles. rgunavailable on Windows: useGet-ChildItemandSelect-Stringas command-line fallback.
Example target cases
1) Claude Opus failing Q6 plot validation
python results/query-drilldown.py \
--results-dir results/atlas-sx-awk-hist \
--model "claude-opus-4.1" \
--question 6 \
--failure-kind validation \
--format markdown
python results/query-validation-errors.py report \
--input results/atlas-sx-awk-hist/validation_errors.yaml \
--model "claude-opus-4.1" \
--question 6 \
--profile atlas-sx-awk-hist
2) Coding failures for Q9 (single model)
python results/query-drilldown.py \
--results-dir results/atlas-sx-awk-hist \
--model "gpt-5-mini" \
--question 9 \
--failure-kind runtime \
--format table
python results/query-extract-failures.py report \
--input results/atlas-sx-awk-hist/error-catalog.yaml \
--model "gpt-5-mini" \
--question 9 \
--profile atlas-sx-awk-hist