benchmark-drilldown

star 1

Use this skill when the user asks *why a specific benchmark run failed* for one model/question slice. phrases: "why did model X fail question Y", "drill down into this benchmark failure", "show me failure details for q6", "what happened in direct-question-06 for claude opus", "plot validation failure root cause"

gordonwatts By gordonwatts schedule Updated 3/24/2026

name: benchmark-drilldown description: Use this skill when the user asks why a specific benchmark run failed for one model/question slice. phrases: "why did model X fail question Y", "drill down into this benchmark failure", "show me failure details for q6", "what happened in direct-question-06 for claude opus", "plot validation failure root cause"

benchmark-drilldown

Use this skill when the user asks why a specific benchmark run failed for one model/question slice.

Trigger phrases

  • "why did model X fail question Y"
  • "drill down into this benchmark failure"
  • "show me failure details for q6"
  • "what happened in direct-question-06 for claude opus"
  • "plot validation failure root cause"

Inputs expected

  • results_dir: folder containing results.csv and run artifacts.
  • model: exact model string used in the run (from results.csv).
  • question: single index (6) or inclusive range (6-9).
  • Optional: failure_kind (runtime, validation, all).
  • Optional: output format (markdown, table, yaml, json).

Exact command workflow

  1. Sanity-check required files
test -f <results_dir>/results.csv
  1. Locate optional catalogs (profile-local first, then repo-root fallback)
test -f <results_dir>/validation_errors.yaml || test -f validation_errors.yaml
test -f <results_dir>/error-catalog.yaml || test -f error-catalog.yaml
  1. (Optional) discover model names and question coverage
python - <<'PY'
import pandas as pd
from pathlib import Path
p = Path('<results_dir>/results.csv')
df = pd.read_csv(p)
print('models=', sorted(df['model'].dropna().unique().tolist()))
print('questions=', sorted(df['question'].dropna().unique().tolist()))
PY
  1. Run drilldown for one slice
python results/query-drilldown.py \
  --results-dir <results_dir> \
  --model "<model>" \
  --question <question|range> \
  --failure-kind <runtime|validation|all> \
  --format <markdown|table|yaml|json>
  1. If failure_kind includes validation, summarize validation catalog
python results/query-validation-errors.py report \
  --input <results_dir>/validation_errors.yaml \
  --model "<model>" \
  --question <question> \
  --profile <profile_name>

If <results_dir>/validation_errors.yaml is missing, retry with --input validation_errors.yaml.

  1. If failure_kind includes runtime, summarize runtime catalog
python results/query-extract-failures.py report \
  --input <results_dir>/error-catalog.yaml \
  --model "<model>" \
  --question <question> \
  --profile <profile_name>

If <results_dir>/error-catalog.yaml is missing, retry with --input error-catalog.yaml.

  1. If catalogs are absent or stale, drill directly into per-trial artifacts
python - <<'PY'
import glob, yaml
paths = sorted(glob.glob('<results_dir>/*_iteration_status.yaml'))
print('found', len(paths), 'iteration status files')
for p in paths[:3]:
    d = yaml.safe_load(open(p, encoding='utf-8'))
    print(p, 'validated=', d.get('validated'), 'msg=', d.get('validation_message', ''))
PY
  1. If catalogs are absent or stale, regenerate them from iteration files (real LLM only)
python results/query-extract-failures.py analyze \
  --output <results_dir>/error-catalog.yaml \
  <results_dir>/*_iteration_status.yaml

python results/query-validation-errors.py analyze \
  --output <results_dir>/validation_errors.yaml \
  <results_dir>/*_iteration_status.yaml

Do not use --dummy-llm for real runs. Validation analysis must use the real LLM. If OpenAI credentials are missing, stop and ask the user to configure the key before continuing.

  1. (Optional) compare generated code to correction/reference code for validation failures

results/query-validation-errors.py loads correction code from results/questions/Question XX.ipynb. For question 16, correction code source is results/questions/Question 16.ipynb.

Argument patterns

  • Question token pattern: ^\d+$ or ^\d+-\d+$.
  • Model argument must exactly match one value in results.csv:model.
  • results_dir should be a direct path like results/atlas-sx-awk-hist.
  • Use --profile only when the catalogs contain multiple profiles.
  • If model filtering returns no rows, retry likely aliases (e.g. gpt-5.2-codex and openai/gpt-5.2-codex).

Output contract

Follow .codex/skills/_shared/templates/output-contract.md and include:

  • Bullet summary of likely root cause(s).
  • A failure table with question/model/failure_kind/count/signature.
  • 1-3 key snippets from produced markdown/yaml/json outputs.
  • Explicit rerun command(s) to validate a fix.

Failure handling

  • Missing results.csv: stop and report exact missing path.
  • Missing validation_errors.yaml or error-catalog.yaml under <results_dir>: retry root-level catalog (validation_errors.yaml / error-catalog.yaml) before reducing confidence.
  • No rows for model/question slice: report empty slice and print nearest available model/question values.
  • No rows after exact model match: retry model aliases with provider prefix/suffix variants.
  • Missing pandas during optional discovery: skip discovery, run direct CLI commands, and state fallback.
  • Missing yaml dependency in reporting scripts: still run query-drilldown.py output and note report step skipped.
  • Missing OpenAI key during validation analyze: stop immediately and ask the user to fix credentials; do not switch to --dummy-llm for real runs.
  • Validation phase appears empty in drilldown output: inspect per-trial validated and validation_message fields directly from *_iteration_status.yaml files.
  • rg unavailable on Windows: use Get-ChildItem and Select-String as command-line fallback.

Example target cases

1) Claude Opus failing Q6 plot validation

python results/query-drilldown.py \
  --results-dir results/atlas-sx-awk-hist \
  --model "claude-opus-4.1" \
  --question 6 \
  --failure-kind validation \
  --format markdown

python results/query-validation-errors.py report \
  --input results/atlas-sx-awk-hist/validation_errors.yaml \
  --model "claude-opus-4.1" \
  --question 6 \
  --profile atlas-sx-awk-hist

2) Coding failures for Q9 (single model)

python results/query-drilldown.py \
  --results-dir results/atlas-sx-awk-hist \
  --model "gpt-5-mini" \
  --question 9 \
  --failure-kind runtime \
  --format table

python results/query-extract-failures.py report \
  --input results/atlas-sx-awk-hist/error-catalog.yaml \
  --model "gpt-5-mini" \
  --question 9 \
  --profile atlas-sx-awk-hist
Install via CLI
npx skills add https://github.com/gordonwatts/hep-data-llm --skill benchmark-drilldown
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator