output-eval-validate-judge - SKILL.md Agent Skill

name: output-eval-validate-judge description: Validate LLM judges against human labels using TPR/TNR metrics and train/dev/test splits. Use after writing a judge prompt to verify it agrees with human judgment. allowed-tools: [Bash, Read, Write, Edit]

Validating LLM Judges

Overview

An LLM judge is only useful if it agrees with human judgment. This skill walks you through calibrating a judge against human-labeled data using True Positive Rate (TPR) and True Negative Rate (TNR) metrics. Do this before trusting any judgeVerdict(), judgeScore(), or judgeLabel() evaluator in your eval suite.

Prerequisites

A judge .prompt file — Written following output-eval-judge-prompt
~100 human-labeled traces — With binary pass/fail labels for the failure mode this judge targets. Aim for ~50 pass and ~50 fail. Minimum: 20 pass and 20 fail.
Labels stored in dataset YAML — Each dataset has ground_truth.evals.<evaluator_name>.verdict: pass or fail

This process applies only to LLM-based judges. For code-based Verdict.* evaluators, write unit tests instead.

Step 1: Create Data Splits

Split your labeled datasets into three groups:

Split	% of Data	Purpose	Example (100 datasets)
Train	10-20%	Source of few-shot examples in the judge prompt	15 datasets
Dev	40-45%	Iterate on judge prompt, measure TPR/TNR	42 datasets
Test	40-45%	Final held-out measurement, run once	43 datasets

Organizing splits

Use a naming convention or subdirectories to separate splits:

Option A: Name prefixes

tests/datasets/
├── train_formal_pass_01.yml
├── train_casual_fail_01.yml
├── dev_technical_pass_01.yml
├── dev_ambiguous_fail_01.yml
├── test_simple_pass_01.yml
├── test_contradictory_fail_01.yml
└── ...

Option B: Subdirectories

tests/datasets/
├── train/
│   ├── formal_pass_01.yml
│   └── casual_fail_01.yml
├── dev/
│   ├── technical_pass_01.yml
│   └── ambiguous_fail_01.yml
└── test/
    ├── simple_pass_01.yml
    └── contradictory_fail_01.yml

Splitting rules

Balance pass/fail in each split — Don't put all failures in dev and all passes in test
Randomize — Don't sort by difficulty or topic
Training examples in the prompt — Use only train-split examples as few-shot in the judge .prompt file. Never use dev or test examples — that's data leakage
Lock the test split — Once created, do not look at test data until final measurement

Step 2: Run the Judge on Dev Set

Execute the eval workflow against only the dev-split datasets:

# Run with cached output on dev datasets
npx output workflow test <workflowName> --cached \
  --dataset dev_technical_pass_01,dev_ambiguous_fail_01,dev_formal_pass_02,...

Or if using subdirectories, list the dev dataset names:

npx output workflow test <workflowName> --cached \
  --dataset $(ls tests/datasets/dev/ | sed 's/.yml//' | tr '\n' ',')

Save the output. You need the judge's verdict for each dataset to compare against ground truth.

Extracting results

Use --json to get machine-readable results:

npx output workflow test <workflowName> --cached --dataset <dev_datasets> --json

The output includes per-dataset, per-evaluator verdicts that you can compare against ground_truth.evals.<evaluator_name>.verdict.

Step 3: Compute TPR and TNR

For the evaluator you're validating, build a confusion matrix from the dev results.

Definitions

Using "fail" as the positive class (what you're trying to detect):

	Judge says Fail	Judge says Pass
Human says Fail	True Positive (TP)	False Negative (FN)
Human says Pass	False Positive (FP)	True Negative (TN)

TPR (True Positive Rate) = TP / (TP + FN)

"Of all the real failures, what fraction did the judge catch?"
Low TPR means the judge misses real failures (dangerous)

TNR (True Negative Rate) = TN / (TN + FP)

"Of all the real passes, what fraction did the judge correctly approve?"
Low TNR means the judge flags passing traces as failures (noisy)

Example computation

Dev set results for check_tone evaluator (42 datasets):

	Judge: Fail	Judge: Pass
Human: Fail	18 (TP)	3 (FN)
Human: Pass	2 (FP)	19 (TN)

TPR = 18 / (18 + 3) = 85.7%
TNR = 19 / (19 + 2) = 90.5%

Why not raw accuracy?

Raw accuracy = (TP + TN) / total = (18 + 19) / 42 = 88.1%

This looks fine, but masks problems. If your dataset were 90% pass (class imbalance), a judge that always says "pass" would get 90% accuracy while catching zero failures (TPR = 0%). TPR and TNR measure what actually matters: catching failures and not crying wolf.

Step 4: Inspect Disagreements

For every case where the judge disagrees with the human label, determine the root cause.

False Negatives (judge missed a real failure)

The judge said "pass" but the human said "fail." For each:

Read the trace and the judge's critique
Determine why the judge missed it:
- Criterion too narrow — The prompt defines failure too narrowly. Broaden the fail definition.
- Missing few-shot example — The failure pattern isn't represented in examples. Add a similar borderline example from the train split.
- Insufficient context — The judge doesn't have the information needed to detect this failure. Add the missing variable to the prompt.

False Positives (judge flagged a passing trace)

The judge said "fail" but the human said "pass." For each:

Read the trace and the judge's critique
Determine why the judge flagged it:
- Criterion too broad — The prompt defines failure too broadly. Tighten the fail definition.
- Misleading few-shot example — A borderline example is being overgeneralized. Clarify or replace it.
- Overly strict — The judge applies the criterion more strictly than intended. Add explicit exceptions to the prompt.

Logging disagreements

Track each disagreement to guide prompt iteration:

Dataset	Human	Judge	Root Cause	Fix
dev_technical_pass_03	pass	fail	Judge flagged "it's" as casual but context was a direct quote	Add exception: "Contractions within direct quotes are acceptable"
dev_ambiguous_fail_02	fail	pass	Judge missed subtle tone shift in paragraph 3	Add borderline few-shot example showing mid-text tone drift

Step 5: Iterate

Apply the fixes from Step 4 to the judge .prompt file. Then re-run on the dev set:

npx output workflow test <workflowName> --cached --dataset <dev_datasets>

Recompute TPR and TNR. Repeat until both metrics meet the target.

Targets

Metric	Target	Minimum Acceptable
TPR	> 90%	> 80%
TNR	> 90%	> 80%

If you can't reach 80%/80% after 3-4 iterations:

Upgrade the model — Switch from Haiku to Sonnet in the .prompt frontmatter
Split the criterion — The failure mode may contain two distinct sub-failures that need separate judges
Revisit the labels — Some human labels may be inconsistent. Re-label disagreements with a second reviewer

Iteration checklist

Each iteration:

Identified root cause for each disagreement
Applied targeted fix to .prompt file (not random changes)
Re-ran on dev set
Recomputed TPR and TNR
Logged the iteration number, changes made, and resulting metrics

Step 6: Final Measurement on Test Set

Once dev metrics meet the target, run the judge on the held-out test set exactly once:

npx output workflow test <workflowName> --cached --dataset <test_datasets> --json

Compute TPR and TNR on the test results. Record these as the final metrics.

Interpreting final results

Test metrics close to dev metrics — The judge generalizes well. Ship it.
Test metrics significantly lower — The judge may be overfit to dev set patterns. Do not iterate on the test set. Instead, gather more labeled data, re-split, and restart from Step 2.

Recording results

Document the final validation results alongside the judge prompt:

# Validation: check_tone (judge_tone@v1.prompt)
# Date: 2026-03-25
# Model: claude-haiku-4-5-20251001

## Dev Set (42 datasets)
- TPR: 90.5% (19/21)
- TNR: 95.2% (20/21)

## Test Set (43 datasets)
- TPR: 88.0% (22/25)
- TNR: 94.4% (17/18)

## Conclusion: APPROVED — both metrics above 80% minimum

Store this in a VALIDATION.md file next to the judge prompt or in the evaluator's documentation.

Anti-Patterns

Assuming judges work without validation — An unvalidated judge may consistently miss failures or flag passing traces
Using dev/test examples as few-shot — Data leakage inflates metrics and hides real performance
Optimizing for raw accuracy — Use TPR and TNR instead; accuracy hides class imbalance problems
Iterating on the test set — Test is held-out. If test metrics are bad, gather more data and re-split
Skipping disagreement analysis — Random prompt tweaks without understanding root causes don't converge
Too few labeled examples — Below 40 total (20 pass + 20 fail), metrics are unreliable due to small sample size

Related Skills

output-eval-judge-prompt — Design the judge prompt being validated
output-eval-error-analysis — Source of human-labeled data for validation
output-eval-dataset-design — Generate additional labeled datasets if you need more data
output-dev-eval-testing — output workflow test CLI, --cached and --dataset flags
output-eval-audit — Audit whether existing judges have been validated