benchmark-failure-classifier

star 1

Use this skill when the user asks to *classify recurring benchmark failures* (runtime or validation), often across models/questions. phrases: "classify these benchmark failures", "group recurring errors", "what failure classes are most common", "coding failures for q9", "summarize validation failures for q6"

gordonwatts By gordonwatts schedule Updated 3/24/2026

name: benchmark-failure-classifier description: Use this skill when the user asks to classify recurring benchmark failures (runtime or validation), often across models/questions. phrases: "classify these benchmark failures", "group recurring errors", "what failure classes are most common", "coding failures for q9", "summarize validation failures for q6"

benchmark-failure-classifier

Use this skill when the user asks to classify recurring benchmark failures (runtime or validation), often across models/questions.

Trigger phrases

  • "classify these benchmark failures"
  • "group recurring errors"
  • "what failure classes are most common"
  • "coding failures for q9"
  • "summarize validation failures for q6"

Inputs expected

  • results_dir: profile output directory under results/.
  • Optional filters: model, question, profile.
  • Optional mode: runtime (error-catalog.yaml) vs validation (validation_errors.yaml) vs both.

Exact command workflow

  1. Check catalog presence
test -f <results_dir>/error-catalog.yaml || echo "missing runtime catalog"
test -f <results_dir>/validation_errors.yaml || echo "missing validation catalog"
test -f error-catalog.yaml || echo "missing root runtime catalog"
test -f validation_errors.yaml || echo "missing root validation catalog"
  1. Runtime classification summary (no LLM needed)
python results/query-extract-failures.py report \
  --input <results_dir>/error-catalog.yaml \
  [--model "<model>"] \
  [--question <question>] \
  [--profile <profile>]

If <results_dir>/error-catalog.yaml is missing, retry with --input error-catalog.yaml.

  1. Validation classification summary (no LLM needed)
python results/query-validation-errors.py report \
  --input <results_dir>/validation_errors.yaml \
  [--model "<model>"] \
  [--question <question>] \
  [--profile <profile>]

If <results_dir>/validation_errors.yaml is missing, retry with --input validation_errors.yaml.

  1. (Optional) regenerate catalogs from per-run files when absent/stale
python results/query-extract-failures.py analyze \
  --output <results_dir>/error-catalog.yaml \
  <results_dir>/*_iteration_status.yaml

python results/query-validation-errors.py analyze \
  --output <results_dir>/validation_errors.yaml \
  <results_dir>/*_iteration_status.yaml

Do not use --dummy-llm for real runs. Validation analysis must use the real LLM. If OpenAI credentials are missing, stop and ask the user to configure the key before continuing.

  1. (Optional) inspect per-trial artifacts when catalogs are absent/stale
python - <<'PY'
import glob, yaml
paths = sorted(glob.glob('<results_dir>/*_iteration_status.yaml'))
print('found', len(paths), 'iteration status files')
for p in paths[:3]:
    d = yaml.safe_load(open(p, encoding='utf-8'))
    print(p, 'validated=', d.get('validated'), 'msg=', d.get('validation_message', ''))
PY
  1. (Optional) drill into one hot-spot slice for evidence snippets
python results/query-drilldown.py \
  --results-dir <results_dir> \
  --model "<model>" \
  --question <question> \
  --failure-kind <runtime|validation> \
  --format markdown

Argument patterns

  • --question accepts numeric index (9) or text substring in report mode.
  • --model should match model names used in result artifacts.
  • analyze commands need a list of YAML run files (glob accepted by shell).
  • If model filtering returns no rows, retry likely aliases (e.g. gpt-5.2-codex and openai/gpt-5.2-codex).

Output contract

Follow .codex/skills/_shared/templates/output-contract.md. Minimum deliverable:

  • Bullet summary (top classes + frequencies + impacted models).
  • Table of classes and counts.
  • Key snippets showing representative signatures.
  • Suggested remediation by class (prompting, code scaffold, data-shape checks, validation tolerances).

Failure handling

  • Missing error-catalog.yaml / validation_errors.yaml: attempt analyze regeneration using *_iteration_status.yaml files.
  • No iteration YAML files found: report inability to regenerate and list expected file pattern.
  • Missing yaml package: report dependency gap and provide pip install pyyaml remediation.
  • Missing pandas package: not required for report flows; only needed for optional custom aggregation.
  • Missing OpenAI key during validation analyze: stop immediately and ask the user to fix credentials; do not switch to --dummy-llm for real runs.
  • Validation phase appears empty in drilldown output: inspect per-trial validated and validation_message fields directly from *_iteration_status.yaml files.
  • rg unavailable on Windows: use Get-ChildItem and Select-String as command-line fallback.

Example target cases

1) Claude Opus failing Q6 plot validation

python results/query-validation-errors.py report \
  --input results/atlas-sx-awk-hist/validation_errors.yaml \
  --model "claude-opus-4.1" \
  --question 6 \
  --profile atlas-sx-awk-hist

python results/query-drilldown.py \
  --results-dir results/atlas-sx-awk-hist \
  --model "claude-opus-4.1" \
  --question 6 \
  --failure-kind validation \
  --format markdown

2) Coding failures for Q9

python results/query-extract-failures.py report \
  --input results/atlas-sx-awk-hist/error-catalog.yaml \
  --question 9 \
  --profile atlas-sx-awk-hist

# Optional per-model drilldown:
python results/query-drilldown.py \
  --results-dir results/atlas-sx-awk-hist \
  --model "gpt-5-mini" \
  --question 9 \
  --failure-kind runtime \
  --format table
Install via CLI
npx skills add https://github.com/gordonwatts/hep-data-llm --skill benchmark-failure-classifier
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator