name: benchmark-failure-classifier description: Use this skill when the user asks to classify recurring benchmark failures (runtime or validation), often across models/questions. phrases: "classify these benchmark failures", "group recurring errors", "what failure classes are most common", "coding failures for q9", "summarize validation failures for q6"
benchmark-failure-classifier
Use this skill when the user asks to classify recurring benchmark failures (runtime or validation), often across models/questions.
Trigger phrases
- "classify these benchmark failures"
- "group recurring errors"
- "what failure classes are most common"
- "coding failures for q9"
- "summarize validation failures for q6"
Inputs expected
results_dir: profile output directory underresults/.- Optional filters:
model,question,profile. - Optional mode: runtime (
error-catalog.yaml) vs validation (validation_errors.yaml) vs both.
Exact command workflow
- Check catalog presence
test -f <results_dir>/error-catalog.yaml || echo "missing runtime catalog"
test -f <results_dir>/validation_errors.yaml || echo "missing validation catalog"
test -f error-catalog.yaml || echo "missing root runtime catalog"
test -f validation_errors.yaml || echo "missing root validation catalog"
- Runtime classification summary (no LLM needed)
python results/query-extract-failures.py report \
--input <results_dir>/error-catalog.yaml \
[--model "<model>"] \
[--question <question>] \
[--profile <profile>]
If <results_dir>/error-catalog.yaml is missing, retry with --input error-catalog.yaml.
- Validation classification summary (no LLM needed)
python results/query-validation-errors.py report \
--input <results_dir>/validation_errors.yaml \
[--model "<model>"] \
[--question <question>] \
[--profile <profile>]
If <results_dir>/validation_errors.yaml is missing, retry with --input validation_errors.yaml.
- (Optional) regenerate catalogs from per-run files when absent/stale
python results/query-extract-failures.py analyze \
--output <results_dir>/error-catalog.yaml \
<results_dir>/*_iteration_status.yaml
python results/query-validation-errors.py analyze \
--output <results_dir>/validation_errors.yaml \
<results_dir>/*_iteration_status.yaml
Do not use --dummy-llm for real runs. Validation analysis must use the real LLM.
If OpenAI credentials are missing, stop and ask the user to configure the key before continuing.
- (Optional) inspect per-trial artifacts when catalogs are absent/stale
python - <<'PY'
import glob, yaml
paths = sorted(glob.glob('<results_dir>/*_iteration_status.yaml'))
print('found', len(paths), 'iteration status files')
for p in paths[:3]:
d = yaml.safe_load(open(p, encoding='utf-8'))
print(p, 'validated=', d.get('validated'), 'msg=', d.get('validation_message', ''))
PY
- (Optional) drill into one hot-spot slice for evidence snippets
python results/query-drilldown.py \
--results-dir <results_dir> \
--model "<model>" \
--question <question> \
--failure-kind <runtime|validation> \
--format markdown
Argument patterns
--questionaccepts numeric index (9) or text substring in report mode.--modelshould match model names used in result artifacts.analyzecommands need a list of YAML run files (glob accepted by shell).- If model filtering returns no rows, retry likely aliases (e.g.
gpt-5.2-codexandopenai/gpt-5.2-codex).
Output contract
Follow .codex/skills/_shared/templates/output-contract.md.
Minimum deliverable:
- Bullet summary (top classes + frequencies + impacted models).
- Table of classes and counts.
- Key snippets showing representative signatures.
- Suggested remediation by class (prompting, code scaffold, data-shape checks, validation tolerances).
Failure handling
- Missing
error-catalog.yaml/validation_errors.yaml: attemptanalyzeregeneration using*_iteration_status.yamlfiles. - No iteration YAML files found: report inability to regenerate and list expected file pattern.
- Missing
yamlpackage: report dependency gap and providepip install pyyamlremediation. - Missing
pandaspackage: not required for report flows; only needed for optional custom aggregation. - Missing OpenAI key during validation
analyze: stop immediately and ask the user to fix credentials; do not switch to--dummy-llmfor real runs. - Validation phase appears empty in drilldown output: inspect per-trial
validatedandvalidation_messagefields directly from*_iteration_status.yamlfiles. rgunavailable on Windows: useGet-ChildItemandSelect-Stringas command-line fallback.
Example target cases
1) Claude Opus failing Q6 plot validation
python results/query-validation-errors.py report \
--input results/atlas-sx-awk-hist/validation_errors.yaml \
--model "claude-opus-4.1" \
--question 6 \
--profile atlas-sx-awk-hist
python results/query-drilldown.py \
--results-dir results/atlas-sx-awk-hist \
--model "claude-opus-4.1" \
--question 6 \
--failure-kind validation \
--format markdown
2) Coding failures for Q9
python results/query-extract-failures.py report \
--input results/atlas-sx-awk-hist/error-catalog.yaml \
--question 9 \
--profile atlas-sx-awk-hist
# Optional per-model drilldown:
python results/query-drilldown.py \
--results-dir results/atlas-sx-awk-hist \
--model "gpt-5-mini" \
--question 9 \
--failure-kind runtime \
--format table