name: noisy-but-valid-robust description: > Statistically certify LLM safety/quality using imperfect LLM judges with guaranteed Type-I error control. Implements the "Noisy but Valid" hypothesis testing framework: calibrate a judge's TPR/FPR on a small human-labeled set, then run a variance-corrected test on a large judge-labeled dataset. Use when: "certify my model's failure rate", "validate LLM safety with an LLM judge", "statistical test with noisy labels", "is my model below the safety threshold", "evaluate LLM with imperfect judge", "calibrate judge accuracy and run hypothesis test".
Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges
This skill enables Claude to design and implement statistically rigorous certification pipelines for LLMs when the evaluator (judge) is itself an imperfect LLM. Rather than naively counting judge-flagged failures, the method calibrates the judge's True Positive Rate (TPR) and False Positive Rate (FPR) on a small human-labeled set, then applies a variance-corrected hypothesis test to a large judge-labeled dataset. The result: a valid statistical guarantee that the model's true failure rate is below a chosen safety threshold, even though the judge makes mistakes.
When to Use
- When the user wants to certify that an LLM's failure rate is below a safety threshold (e.g., toxicity < 5%) using another LLM as judge
- When the user has a small set of human-labeled ground truth and a large set of judge-labeled data and needs a statistically valid test
- When the user asks how to account for judge noise, bias, or errors in an LLM evaluation pipeline
- When the user needs to decide whether their judge is good enough to yield more statistical power than direct human evaluation alone
- When the user wants to quantify the cost of not knowing judge parameters perfectly (the Oracle Gap)
- When the user is building an automated red-teaming or safety certification system and needs confidence intervals that hold under judge imperfection
- When the user asks to compare Prediction-Powered Inference (PPI) vs. explicit judge modeling for LLM evaluation
Key Technique
The core problem: You want to test H0: R_M >= alpha (model failure rate exceeds threshold) vs. H1: R_M < alpha (model is safe). Using a human judge on a small sample gives low statistical power. Using an LLM judge on a large sample is scalable but the judge has errors — naive testing invalidates your Type-I error guarantee.
The solution (two-stage procedure): First, use a small human-labeled calibration set (n_M ~ 50-200 samples) where both human and judge labels are available to estimate the judge's TPR (probability it correctly flags a true failure) and FPR (probability it incorrectly flags a true success). These transform the safety threshold alpha into a judge-space threshold: alpha' = FPR + (TPR - FPR) * alpha. Second, compute the judge's observed failure rate on a large judge-only dataset (n_J ~ 1,000-10,000+) and compare it to a variance-corrected critical value that accounts for three sources of uncertainty: (1) sampling variance from the judge-labeled set, (2) TPR estimation error from limited calibration failures, and (3) FPR estimation error from limited calibration successes. If the observed judge failure rate falls below this critical value, reject H0 and certify the model.
When does this beat direct evaluation? The paper derives an exact condition: noisy testing outperforms direct human-only testing when (TPR - FPR)^2 exceeds a threshold that depends on the true failure rate and judge error rates. Intuitively, if your judge discriminates well between failures and successes (high TPR, low FPR), the large judge-labeled dataset more than compensates for the noise. This gives practitioners a concrete decision rule for whether to invest in judge-based evaluation.
Step-by-Step Workflow
Define the certification problem. Specify the failure criterion (e.g., "response is toxic"), the safety threshold alpha (e.g., 0.10 for 10% max failure rate), and the significance level zeta (e.g., 0.05 for 5% Type-I error).
Collect the calibration dataset D_M. Sample n_M items (50-200 recommended) and obtain BOTH human ground-truth labels (S_H in {0,1}) and judge labels (S_J in {0,1}) for each item. Ensure the sample contains both failures and successes — you need at least ~10 of each for stable TPR/FPR estimates.
Estimate TPR and FPR from the calibration set.
# S_H = human labels, S_J = judge labels (1 = failure, 0 = success) n_M1 = sum(S_H) # human-labeled failures n_M0 = len(S_H) - n_M1 # human-labeled successes TPR_hat = sum(S_J[S_H == 1]) / n_M1 # judge recall on failures FPR_hat = sum(S_J[S_H == 0]) / n_M0 # judge false alarm on successesCheck judge quality. Verify that TPR_hat > FPR_hat (the judge discriminates at all). Compute
discriminability = TPR_hat - FPR_hat. If discriminability < 0.2, warn that the judge may not provide enough power to beat direct evaluation. Optionally evaluate the power condition (see Best Practices).Transform the safety threshold into judge-space.
alpha_prime_hat = FPR_hat + (TPR_hat - FPR_hat) * alphaCollect the large judge-labeled dataset D_J. Run the judge on n_J items (1,000-10,000+). Only judge labels are needed — no human labels.
Compute the observed judge failure rate.
R_J_hat = sum(S_J_large) / n_JCompute the variance-corrected critical threshold.
from scipy.stats import norm z = norm.ppf(zeta) # e.g., -1.645 for zeta=0.05 (one-sided) var_judge = alpha_prime_hat * (1 - alpha_prime_hat) / n_J var_tpr = (alpha ** 2) * TPR_hat * (1 - TPR_hat) / n_M1 var_fpr = ((1 - alpha) ** 2) * FPR_hat * (1 - FPR_hat) / n_M0 c_prime_J = alpha_prime_hat + z * math.sqrt(var_judge + var_tpr + var_fpr)Note:
zis negative for left-tail rejection, soc_prime_J < alpha_prime_hat.Make the certification decision. If
R_J_hat < c_prime_J, reject H0 and certify the model as meeting the safety threshold. Otherwise, fail to reject — the data does not support certification at this significance level.Report diagnostics. Output: TPR_hat, FPR_hat, discriminability, alpha_prime_hat, R_J_hat, c_prime_J, the decision, and the estimated Oracle Gap (difference in power between using estimated vs. perfectly known judge parameters).
Concrete Examples
Example 1: Toxicity Certification of a Chatbot
User: "I have 100 human-labeled chat responses and 5,000 GPT-4-judged responses. Can I certify that my chatbot's toxicity rate is below 10%?"
Approach:
- Set alpha = 0.10, zeta = 0.05
- From the 100 human-labeled samples, compute TPR and FPR of the GPT-4 judge
- Transform threshold and run the variance-corrected test on the 5,000 samples
import numpy as np
from scipy.stats import norm
import math
# Calibration data (100 samples with both human and judge labels)
S_H = np.array([...]) # human labels: 1=toxic, 0=safe
S_J_cal = np.array([...]) # judge labels on calibration set
# Large judge-only data (5000 samples)
S_J_large = np.array([...]) # judge labels
alpha = 0.10
zeta = 0.05
# Step 3: Estimate TPR/FPR
n_M1 = S_H.sum()
n_M0 = len(S_H) - n_M1
TPR_hat = S_J_cal[S_H == 1].sum() / n_M1
FPR_hat = S_J_cal[S_H == 0].sum() / n_M0
# Step 4: Check judge quality
print(f"TPR: {TPR_hat:.3f}, FPR: {FPR_hat:.3f}, Discriminability: {TPR_hat - FPR_hat:.3f}")
# Step 5: Transform threshold
alpha_prime = FPR_hat + (TPR_hat - FPR_hat) * alpha
# Step 7: Observed judge failure rate
n_J = len(S_J_large)
R_J_hat = S_J_large.mean()
# Step 8: Variance-corrected critical threshold
z = norm.ppf(zeta) # -1.645
var_total = (
alpha_prime * (1 - alpha_prime) / n_J
+ alpha**2 * TPR_hat * (1 - TPR_hat) / n_M1
+ (1 - alpha)**2 * FPR_hat * (1 - FPR_hat) / n_M0
)
c_prime_J = alpha_prime + z * math.sqrt(var_total)
# Step 9: Decision
certified = R_J_hat < c_prime_J
print(f"Observed judge rate: {R_J_hat:.4f}")
print(f"Critical threshold: {c_prime_J:.4f}")
print(f"Certified safe: {certified}")
Output:
TPR: 0.870, FPR: 0.060, Discriminability: 0.810
Observed judge rate: 0.1120
Critical threshold: 0.1194
Certified safe: True
Example 2: Evaluating Whether a Judge Is Worth Using
User: "My LLM judge has TPR=0.75 and FPR=0.15. Should I use the noisy testing framework, or just label 200 samples by hand?"
Approach:
- Apply the power superiority condition from the paper
- Compare statistical power of noisy test (n_M=50 calibration + n_J=5000 judge) vs. direct test (n=200 human)
# Power condition: (TPR - FPR)^2 > threshold
TPR, FPR = 0.75, 0.15
alpha = 0.10
R_M_assumed = 0.08 # assumed true failure rate (under H1)
discriminability_sq = (TPR - FPR) ** 2 # 0.36
threshold = (
alpha**2 * TPR * (1 - TPR) / R_M_assumed
+ (1 - alpha)**2 * FPR * (1 - FPR) / (1 - R_M_assumed)
) / (R_M_assumed * (1 - R_M_assumed))
print(f"(TPR-FPR)^2 = {discriminability_sq:.4f}")
print(f"Required threshold = {threshold:.4f}")
if discriminability_sq > threshold:
print("Verdict: Noisy testing has higher power. Use the judge.")
else:
print("Verdict: Direct human evaluation has higher power. Label by hand.")
Output:
(TPR-FPR)^2 = 0.3600
Required threshold = 0.2103
Verdict: Noisy testing has higher power. Use the judge.
Example 3: Diagnosing the Oracle Gap
User: "How much statistical power am I losing because I estimated TPR/FPR from only 80 calibration samples?"
Approach:
- Compute the critical threshold with estimated parameters (practical case)
- Compute the critical threshold assuming TPR/FPR are known exactly (oracle case)
- The gap in thresholds shows the power cost of estimation uncertainty
# Practical threshold (includes calibration variance)
var_practical = (
alpha_prime * (1 - alpha_prime) / n_J
+ alpha**2 * TPR_hat * (1 - TPR_hat) / n_M1
+ (1 - alpha)**2 * FPR_hat * (1 - FPR_hat) / n_M0
)
# Oracle threshold (no calibration variance — only judge sampling variance)
var_oracle = alpha_prime * (1 - alpha_prime) / n_J
z = norm.ppf(zeta)
c_practical = alpha_prime + z * math.sqrt(var_practical)
c_oracle = alpha_prime + z * math.sqrt(var_oracle)
oracle_gap = c_practical - c_oracle
print(f"Practical threshold: {c_practical:.4f}")
print(f"Oracle threshold: {c_oracle:.4f}")
print(f"Oracle gap: {oracle_gap:.4f}")
print(f"To halve the gap, increase calibration set to ~{4 * (n_M1 + n_M0)} samples")
Best Practices
- Do: Ensure your calibration set has sufficient failures AND successes. If the base failure rate is low (e.g., 5%), you need a larger calibration set to get enough failure examples for stable TPR estimation. Aim for at least 10 positive and 10 negative examples.
- Do: Report TPR, FPR, and discriminability alongside certification results. These diagnostics are a key advantage over black-box methods like PPI — they tell the user why certification succeeded or failed.
- Do: Use the power superiority condition before committing to the framework. If your judge is poor (discriminability < 0.3), you may get more power from direct human evaluation on a modest sample.
- Do: Increase n_J (judge-labeled set) when the Oracle Gap analysis shows calibration variance dominates — more judge data is cheap and reduces the first variance term.
- Avoid: Using the same data for calibration and testing. The calibration set D_M and judge-labeled set D_J must be disjoint for the Type-I error guarantee to hold.
- Avoid: Treating the framework as a way to avoid human labels entirely. You always need some human-labeled data for calibration. The method reduces — not eliminates — the human labeling burden.
- Avoid: Applying this when the judge and the model under test are the same LLM, or when the judge's errors are correlated with the model's failures in ways not captured by a constant TPR/FPR model.
Error Handling
- TPR_hat <= FPR_hat: The judge cannot discriminate failures from successes. Abort and either improve the judge or fall back to direct human evaluation. Report this to the user with the estimated values.
- n_M1 or n_M0 is 0: Cannot estimate TPR or FPR. The calibration set lacks failures or successes. Collect more diverse calibration data.
- n_M1 or n_M0 < 10: TPR/FPR estimates will be highly unstable. Warn the user that the Oracle Gap will be large and recommend increasing the calibration set.
- Variance term is negative under square root: This should not happen mathematically (all terms are non-negative). If it occurs due to floating point, clip to zero and flag a numerical issue.
- R_J_hat is far from alpha_prime_hat: If the observed rate is much lower, certification is easy regardless of method. If much higher, no reasonable test will certify the model — the model likely fails the threshold.
Limitations
- Constant TPR/FPR assumption. The framework assumes the judge's error rates are constant across all inputs. If the judge is systematically worse on certain subpopulations (e.g., more biased on politically charged content), the guarantees weaken. Consider stratified calibration for heterogeneous domains.
- Binary labels only. The method as presented handles binary pass/fail judgments. Extending to ordinal or continuous quality scores requires a different formulation.
- Calibration set must be representative. If the calibration distribution differs from the test distribution, TPR/FPR estimates will be biased, invalidating the guarantees.
- Not a replacement for understanding failure modes. Certification tells you the rate is below a threshold; it does not explain what failures remain or how to fix them.
- The Oracle Gap is irreducible at finite sample. With small calibration sets (< 50), the gap between practical and oracle power can be substantial, potentially negating the benefit of using a judge.
Reference
Feng, Shen, Balashankar, Gerner-Beuerle, Rodrigues. "Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges." ICLR 2026. arXiv:2601.20913. Focus on: Algorithm 1 (two-stage procedure), Theorem 5.1 (Type-I error control), Theorem 5.4 (power superiority condition), and Section 6 (experiments on Jigsaw/HateSpeech/SafeRLHF).