output-eval-validate-judge

star 422

Validate LLM judges against human labels using TPR/TNR metrics and train/dev/test splits. Use after writing a judge prompt to verify it agrees with human judgment.

growthxai By growthxai schedule Updated 4/8/2026

name: output-eval-validate-judge description: Validate LLM judges against human labels using TPR/TNR metrics and train/dev/test splits. Use after writing a judge prompt to verify it agrees with human judgment. allowed-tools: [Bash, Read, Write, Edit]

Validating LLM Judges

Overview

An LLM judge is only useful if it agrees with human judgment. This skill walks you through calibrating a judge against human-labeled data using True Positive Rate (TPR) and True Negative Rate (TNR) metrics. Do this before trusting any judgeVerdict(), judgeScore(), or judgeLabel() evaluator in your eval suite.

Prerequisites

  1. A judge .prompt file — Written following output-eval-judge-prompt
  2. ~100 human-labeled traces — With binary pass/fail labels for the failure mode this judge targets. Aim for ~50 pass and ~50 fail. Minimum: 20 pass and 20 fail.
  3. Labels stored in dataset YAML — Each dataset has ground_truth.evals.<evaluator_name>.verdict: pass or fail

This process applies only to LLM-based judges. For code-based Verdict.* evaluators, write unit tests instead.

Step 1: Create Data Splits

Split your labeled datasets into three groups:

Split % of Data Purpose Example (100 datasets)
Train 10-20% Source of few-shot examples in the judge prompt 15 datasets
Dev 40-45% Iterate on judge prompt, measure TPR/TNR 42 datasets
Test 40-45% Final held-out measurement, run once 43 datasets

Organizing splits

Use a naming convention or subdirectories to separate splits:

Option A: Name prefixes

tests/datasets/
├── train_formal_pass_01.yml
├── train_casual_fail_01.yml
├── dev_technical_pass_01.yml
├── dev_ambiguous_fail_01.yml
├── test_simple_pass_01.yml
├── test_contradictory_fail_01.yml
└── ...

Option B: Subdirectories

tests/datasets/
├── train/
│   ├── formal_pass_01.yml
│   └── casual_fail_01.yml
├── dev/
│   ├── technical_pass_01.yml
│   └── ambiguous_fail_01.yml
└── test/
    ├── simple_pass_01.yml
    └── contradictory_fail_01.yml

Splitting rules

  • Balance pass/fail in each split — Don't put all failures in dev and all passes in test
  • Randomize — Don't sort by difficulty or topic
  • Training examples in the prompt — Use only train-split examples as few-shot in the judge .prompt file. Never use dev or test examples — that's data leakage
  • Lock the test split — Once created, do not look at test data until final measurement

Step 2: Run the Judge on Dev Set

Execute the eval workflow against only the dev-split datasets:

# Run with cached output on dev datasets
npx output workflow test <workflowName> --cached \
  --dataset dev_technical_pass_01,dev_ambiguous_fail_01,dev_formal_pass_02,...

Or if using subdirectories, list the dev dataset names:

npx output workflow test <workflowName> --cached \
  --dataset $(ls tests/datasets/dev/ | sed 's/.yml//' | tr '\n' ',')

Save the output. You need the judge's verdict for each dataset to compare against ground truth.

Extracting results

Use --json to get machine-readable results:

npx output workflow test <workflowName> --cached --dataset <dev_datasets> --json

The output includes per-dataset, per-evaluator verdicts that you can compare against ground_truth.evals.<evaluator_name>.verdict.

Step 3: Compute TPR and TNR

For the evaluator you're validating, build a confusion matrix from the dev results.

Definitions

Using "fail" as the positive class (what you're trying to detect):

Judge says Fail Judge says Pass
Human says Fail True Positive (TP) False Negative (FN)
Human says Pass False Positive (FP) True Negative (TN)

TPR (True Positive Rate) = TP / (TP + FN)

  • "Of all the real failures, what fraction did the judge catch?"
  • Low TPR means the judge misses real failures (dangerous)

TNR (True Negative Rate) = TN / (TN + FP)

  • "Of all the real passes, what fraction did the judge correctly approve?"
  • Low TNR means the judge flags passing traces as failures (noisy)

Example computation

Dev set results for check_tone evaluator (42 datasets):

Judge: Fail Judge: Pass
Human: Fail 18 (TP) 3 (FN)
Human: Pass 2 (FP) 19 (TN)
  • TPR = 18 / (18 + 3) = 85.7%
  • TNR = 19 / (19 + 2) = 90.5%

Why not raw accuracy?

Raw accuracy = (TP + TN) / total = (18 + 19) / 42 = 88.1%

This looks fine, but masks problems. If your dataset were 90% pass (class imbalance), a judge that always says "pass" would get 90% accuracy while catching zero failures (TPR = 0%). TPR and TNR measure what actually matters: catching failures and not crying wolf.

Step 4: Inspect Disagreements

For every case where the judge disagrees with the human label, determine the root cause.

False Negatives (judge missed a real failure)

The judge said "pass" but the human said "fail." For each:

  1. Read the trace and the judge's critique
  2. Determine why the judge missed it:
    • Criterion too narrow — The prompt defines failure too narrowly. Broaden the fail definition.
    • Missing few-shot example — The failure pattern isn't represented in examples. Add a similar borderline example from the train split.
    • Insufficient context — The judge doesn't have the information needed to detect this failure. Add the missing variable to the prompt.

False Positives (judge flagged a passing trace)

The judge said "fail" but the human said "pass." For each:

  1. Read the trace and the judge's critique
  2. Determine why the judge flagged it:
    • Criterion too broad — The prompt defines failure too broadly. Tighten the fail definition.
    • Misleading few-shot example — A borderline example is being overgeneralized. Clarify or replace it.
    • Overly strict — The judge applies the criterion more strictly than intended. Add explicit exceptions to the prompt.

Logging disagreements

Track each disagreement to guide prompt iteration:

Dataset Human Judge Root Cause Fix
dev_technical_pass_03 pass fail Judge flagged "it's" as casual but context was a direct quote Add exception: "Contractions within direct quotes are acceptable"
dev_ambiguous_fail_02 fail pass Judge missed subtle tone shift in paragraph 3 Add borderline few-shot example showing mid-text tone drift

Step 5: Iterate

Apply the fixes from Step 4 to the judge .prompt file. Then re-run on the dev set:

npx output workflow test <workflowName> --cached --dataset <dev_datasets>

Recompute TPR and TNR. Repeat until both metrics meet the target.

Targets

Metric Target Minimum Acceptable
TPR > 90% > 80%
TNR > 90% > 80%

If you can't reach 80%/80% after 3-4 iterations:

  1. Upgrade the model — Switch from Haiku to Sonnet in the .prompt frontmatter
  2. Split the criterion — The failure mode may contain two distinct sub-failures that need separate judges
  3. Revisit the labels — Some human labels may be inconsistent. Re-label disagreements with a second reviewer

Iteration checklist

Each iteration:

  • Identified root cause for each disagreement
  • Applied targeted fix to .prompt file (not random changes)
  • Re-ran on dev set
  • Recomputed TPR and TNR
  • Logged the iteration number, changes made, and resulting metrics

Step 6: Final Measurement on Test Set

Once dev metrics meet the target, run the judge on the held-out test set exactly once:

npx output workflow test <workflowName> --cached --dataset <test_datasets> --json

Compute TPR and TNR on the test results. Record these as the final metrics.

Interpreting final results

  • Test metrics close to dev metrics — The judge generalizes well. Ship it.
  • Test metrics significantly lower — The judge may be overfit to dev set patterns. Do not iterate on the test set. Instead, gather more labeled data, re-split, and restart from Step 2.

Recording results

Document the final validation results alongside the judge prompt:

# Validation: check_tone (judge_tone@v1.prompt)
# Date: 2026-03-25
# Model: claude-haiku-4-5-20251001

## Dev Set (42 datasets)
- TPR: 90.5% (19/21)
- TNR: 95.2% (20/21)

## Test Set (43 datasets)
- TPR: 88.0% (22/25)
- TNR: 94.4% (17/18)

## Conclusion: APPROVED — both metrics above 80% minimum

Store this in a VALIDATION.md file next to the judge prompt or in the evaluator's documentation.

Anti-Patterns

  • Assuming judges work without validation — An unvalidated judge may consistently miss failures or flag passing traces
  • Using dev/test examples as few-shot — Data leakage inflates metrics and hides real performance
  • Optimizing for raw accuracy — Use TPR and TNR instead; accuracy hides class imbalance problems
  • Iterating on the test set — Test is held-out. If test metrics are bad, gather more data and re-split
  • Skipping disagreement analysis — Random prompt tweaks without understanding root causes don't converge
  • Too few labeled examples — Below 40 total (20 pass + 20 fail), metrics are unreliable due to small sample size

Related Skills

  • output-eval-judge-prompt — Design the judge prompt being validated
  • output-eval-error-analysis — Source of human-labeled data for validation
  • output-eval-dataset-design — Generate additional labeled datasets if you need more data
  • output-dev-eval-testingoutput workflow test CLI, --cached and --dataset flags
  • output-eval-audit — Audit whether existing judges have been validated
Install via CLI
npx skills add https://github.com/growthxai/output --skill output-eval-validate-judge
Repository Details
star Stars 422
call_split Forks 12
navigation Branch main
article Path SKILL.md
More from Creator