benchmarking-uncertainty-calibration-long-form - SKILL.md Agent Skill

name: "benchmarking-uncertainty-calibration-long-form" description: | Implement uncertainty quantification and calibration assessment for LLM-generated long-form answers. Apply answer-frequency consistency, verbalized confidence elicitation, token-level analysis, and multi-metric calibration benchmarking based on the UQ framework from Müller et al. (2026). Trigger phrases: - "measure how confident the model is in this answer" - "calibrate uncertainty on these QA results" - "benchmark uncertainty quantification for my LLM pipeline" - "which uncertainty method should I use for scientific QA" - "detect unreliable LLM answers" - "evaluate calibration of model confidence scores"

Benchmarking Uncertainty Calibration for LLM Long-Form QA

This skill enables Claude to design, implement, and evaluate uncertainty quantification (UQ) pipelines for LLM-generated long-form answers. It applies the findings from Müller et al. (2026), which benchmarked UQ methods across 20 LLMs and 685,000 responses on scientific QA tasks. The core actionable insight: answer frequency (consistency across multiple sampled generations) yields the most reliable calibration, while verbalized confidence is systematically biased and token-level probabilities are degraded by instruction tuning. This skill teaches how to build systems that surface trustworthy uncertainty estimates and avoid common calibration measurement pitfalls.

When to Use

When the user needs to flag unreliable LLM answers in a QA pipeline (e.g., scientific literature review, medical QA, legal research)
When building a system that must decide whether to surface an LLM answer or escalate to a human reviewer
When the user asks which UQ method to use for instruction-tuned or reasoning models
When evaluating whether an LLM's self-reported confidence scores are trustworthy
When implementing selective prediction (answer only when confident, abstain otherwise)
When benchmarking multiple LLMs and needing to compare their calibration quality
When the user wants to audit an existing confidence-scoring system for hidden biases

Key Technique

The problem. LLMs produce answers with no built-in reliability signal. Practitioners need uncertainty estimates to decide when to trust model output. Four main approaches exist: (1) token-level probabilities (the softmax confidence on generated tokens), (2) verbalized confidence (asking the model "how confident are you?"), (3) answer frequency (sampling N responses and measuring consistency), and (4) claim-conditioned probability (CCP, computing entailment-vs-contradiction token ratios). The paper finds these methods are not equally reliable, and the standard way of measuring them (ECE alone) is misleading.

What works. Answer frequency — generating 10+ responses to the same prompt and computing the proportion of semantically equivalent answers — provides the best-calibrated uncertainty signal. It is robust to the probability mass polarization that instruction tuning induces (where models collapse nearly all softmax mass onto a single token, destroying the information in probability distributions). Verbalized confidence, by contrast, is systematically overconfident and poorly correlated with correctness. Token-level methods (including P(True) and CCP) degrade as models become more heavily fine-tuned.

How to measure calibration correctly. Expected Calibration Error (ECE) alone is insufficient — it collapses when confidence scores cluster in a narrow range, making poorly calibrated models appear well-calibrated. Always pair ECE with AUROC (discrimination ability), Brier score (proper scoring rule), and visual calibration plots. Evaluate on domain-specific data: factual retrieval tasks show different calibration profiles than multi-step reasoning tasks like GSM8K or GPQA.

Step-by-Step Workflow

Define the QA task and correctness criterion. Determine whether answers are multiple-choice (compare to ground-truth label), arithmetic (extract and compare numerical result), or open-ended (require NLI-based semantic matching). This choice determines how you compute the binary correctness signal needed for calibration.
Subsample and structure the evaluation set. Select 200-500 questions from your dataset. For each question, prepare a prompt template that elicits long-form reasoning (use Chain-of-Thought or APriCoT-style counterfactual prompting for MCQA).
Generate multiple responses per question. For each question, sample N=10 completions at temperature 0.7-1.0. Store each response with its full token-level log-probabilities if the API exposes them (OpenAI, Mistral, and vLLM-served models do). This gives you the raw material for all UQ methods.
Compute answer-frequency uncertainty. For each question, cluster the N responses by semantic equivalence. For MCQA, extract the selected option letter. For arithmetic QA, extract the final numerical answer. For open-ended QA, use an NLI model (e.g., DeBERTa-v3-large-mnli) to group responses that mutually entail each other. The frequency of the most common cluster divided by N is the confidence score. confidence = count_of_most_common_cluster / N.
Compute verbalized confidence (for comparison). After generating the answer, issue a follow-up prompt: "On a scale from 0.0 to 1.0, what is the probability that your answer above is correct? Respond with only a decimal number." Parse the returned number. Note: this method is included for benchmarking, not as a recommended production signal.
Compute token-level confidence (if logprobs available). For MCQA, use the P(True) approach: prompt the model with its own answer and ask it to classify as "(A) True" or "(B) False"; extract the softmax probability assigned to the "True" token. For arithmetic, take the mean log-probability of the tokens in the final numerical answer.
Score correctness for every response. Compare each response to the ground-truth answer. Produce a binary array y_correct[i] for each question. For answer-frequency, correctness is whether the majority-cluster answer matches ground truth.
Compute calibration metrics — never ECE alone. Bin confidence scores into 10-15 equal-width bins. Compute:
- ECE: weighted average of |accuracy_in_bin - mean_confidence_in_bin| across bins
- Brier score: mean of (confidence - correctness)^2 across all items
- AUROC: treat correctness as the label and confidence as the score; compute area under ROC
- Calibration plot: plot bin-level accuracy vs. bin-level mean confidence; the diagonal is perfect calibration
Diagnose failure modes. Check for probability mass polarization: if >90% of token-level confidence scores fall in the [0.95, 1.0] bin, token-level methods are unreliable for this model. Check for verbalized overconfidence: if mean verbalized confidence exceeds accuracy by >15 percentage points, the model is systematically overconfident. Check for ECE-accuracy coupling: if ECE is low but AUROC is also low (~0.5), the ECE is misleadingly optimistic.
Select the best UQ method for your deployment. Rank methods by AUROC (discrimination) first, then by Brier score (calibration + sharpness). Use the winning method as the production uncertainty signal for selective prediction or human-escalation thresholds.

Concrete Examples

Example 1: Building a confidence filter for a science tutoring chatbot

User: "I'm building a science QA chatbot. I want to only show answers the model is confident about and route uncertain ones to human tutors. How should I measure confidence?"

Approach:

For each student question, sample 10 responses from the LLM at temperature 0.8.
Extract the core answer from each response (the specific claim or value).
Cluster responses by semantic equivalence using string matching for factual answers or an NLI model for explanatory answers.
Compute answer frequency: confidence = size_of_largest_cluster / 10.
Set a threshold (e.g., 0.7): if confidence >= 0.7, show the majority answer; otherwise, route to a human tutor.
Validate the threshold on a held-out set of 200 questions with known answers, checking that accuracy among shown answers exceeds your target (e.g., 90%).

Output:

import collections
from transformers import pipeline

nli = pipeline("text-classification", model="microsoft/deberta-v3-large-mnli")

def compute_answer_frequency(responses: list[str], n_samples: int = 10) -> tuple[str, float]:
    """Cluster responses by semantic equivalence, return (best_answer, confidence)."""
    clusters = []
    for resp in responses:
        placed = False
        for cluster in clusters:
            result = nli(f"{cluster[0]} [SEP] {resp}", top_k=1)
            if result[0]["label"] == "ENTAILMENT" and result[0]["score"] > 0.7:
                cluster.append(resp)
                placed = True
                break
        if not placed:
            clusters.append([resp])
    largest = max(clusters, key=len)
    return largest[0], len(largest) / n_samples

# Usage in pipeline
responses = [llm.generate(question, temperature=0.8) for _ in range(10)]
best_answer, confidence = compute_answer_frequency(responses)
if confidence >= 0.7:
    show_to_user(best_answer, confidence)
else:
    escalate_to_human(question)

Example 2: Auditing verbalized confidence for systematic bias

User: "Our LLM pipeline asks the model to rate its own confidence 0-1. Is that reliable?"

Approach:

Collect 300+ question-answer pairs where ground-truth correctness is known.
For each, record the model's verbalized confidence and whether the answer is correct.
Bin confidences into 10 bins ([0.0-0.1], [0.1-0.2], ..., [0.9-1.0]).
Plot accuracy per bin vs. mean confidence per bin (calibration plot).
Compute ECE, Brier score, and AUROC.
Compare against answer-frequency confidence from 10 samples per question.

Output:

import numpy as np
from sklearn.metrics import roc_auc_score, brier_score_loss

def calibration_report(confidences: np.ndarray, correctness: np.ndarray, n_bins: int = 10):
    """Compute ECE, Brier, AUROC and print calibration diagnostics."""
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    print(f"{'Bin':>12} {'Count':>6} {'Acc':>6} {'Conf':>6} {'|Gap|':>6}")
    for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
        mask = (confidences >= lo) & (confidences < hi)
        if mask.sum() == 0:
            continue
        bin_acc = correctness[mask].mean()
        bin_conf = confidences[mask].mean()
        gap = abs(bin_acc - bin_conf)
        ece += gap * mask.sum()
        print(f"  [{lo:.1f},{hi:.1f}) {mask.sum():>6} {bin_acc:>6.3f} {bin_conf:>6.3f} {gap:>6.3f}")
    ece /= len(confidences)
    brier = brier_score_loss(correctness, confidences)
    auroc = roc_auc_score(correctness, confidences)
    print(f"\nECE:   {ece:.4f}")
    print(f"Brier: {brier:.4f}")
    print(f"AUROC: {auroc:.4f}")
    if auroc < 0.55:
        print("WARNING: AUROC near chance — confidence scores have no discriminative power.")
    if confidences.mean() - correctness.mean() > 0.15:
        print("WARNING: Systematic overconfidence detected (mean conf >> mean accuracy).")
    return {"ece": ece, "brier": brier, "auroc": auroc}

# Compare verbalized vs answer-frequency
verb_report = calibration_report(verbalized_confs, correct_labels)
freq_report = calibration_report(frequency_confs, correct_labels)

Example 3: Detecting probability mass polarization in a fine-tuned model

User: "I fine-tuned Llama-3 for chemistry QA. Can I trust the logprob-based confidence?"

Approach:

Generate 500 answers with logprobs enabled.
For each answer, extract the max token probability for the answer tokens.
Plot the distribution of these max-token probabilities.
If >85% of values exceed 0.95, probability mass polarization has occurred.
Fall back to answer-frequency as the uncertainty method.

Output:

def detect_polarization(max_token_probs: list[float], threshold: float = 0.95) -> bool:
    """Check if token-level probs are polarized (unreliable for UQ)."""
    fraction_above = sum(1 for p in max_token_probs if p > threshold) / len(max_token_probs)
    print(f"Fraction of max-token probs > {threshold}: {fraction_above:.2%}")
    if fraction_above > 0.85:
        print("POLARIZED: Token-level confidence is unreliable for this model.")
        print("Recommendation: Use answer-frequency (sample N=10) instead.")
        return True
    print("Token-level confidence may be usable. Validate with calibration metrics.")
    return False

Best Practices

Do: Always sample at least 10 responses per question when computing answer frequency. Fewer samples produce noisy estimates; 10 is the empirically validated sweet spot from the benchmark.
Do: Report ECE, Brier score, AUROC, and a calibration plot together. Any single metric can be misleading in isolation — especially ECE when confidence scores are clustered.
Do: Use temperature 0.7-1.0 for sampling diversity. Temperature 0 produces identical outputs, making answer-frequency meaningless.
Do: Test calibration on your specific domain. Factual retrieval tasks calibrate differently than multi-step reasoning tasks (GSM8K-style arithmetic shows substantially higher ECE).
Avoid: Trusting verbalized confidence ("I am 90% sure") as a production signal. It is systematically overconfident and poorly correlated with correctness across all tested models.
Avoid: Using only token-level logprobs from instruction-tuned or RLHF'd models. Probability mass polarization makes these scores near-binary and uninformative.

Error Handling

API does not expose logprobs: Skip token-level and P(True) methods entirely. Answer frequency works with any black-box API that supports sampling.
NLI model disagrees on semantic equivalence: For structured answers (numbers, option letters), use exact string matching instead of NLI. Reserve NLI-based clustering for open-ended text.
Too few unique answers across samples: If all 10 samples return the same answer (frequency = 1.0), the model may be correct or the question may be trivially easy. Cross-check against known accuracy on similar questions before trusting high-frequency scores.
ECE looks excellent but AUROC is near 0.5: This is the ECE-accuracy coupling artifact. The model's confidences are not discriminating between correct and incorrect answers; they are simply clustered around the base accuracy rate. Do not deploy this as a reliable filter.
Verbalized confidence returns non-numeric text: Parse defensively. If the model writes "about 85%", extract 0.85. If parsing fails, exclude the sample rather than imputing a default.

Limitations

Answer frequency requires N API calls per question (typically 10x cost). For latency-sensitive or cost-constrained applications, consider batching or caching.
Semantic equivalence clustering via NLI is imperfect for long, nuanced answers where partial correctness matters. The method works best when answers have a clear right/wrong signal.
All findings are validated on scientific QA (MMLU, ARC, SciQ, GPQA, GSM8K, SVAMP, SciBench). Calibration behavior may differ on creative, subjective, or conversational tasks.
The benchmark covers models up to 70B parameters from five providers (OpenAI, Mistral, Meta, Qwen, Google). Results may not generalize to significantly larger or architecturally different models.
Reasoning models (chain-of-thought variants) show provider-dependent mitigation of polarization — there is no universal guarantee that reasoning traces improve calibration.

Reference

Müller, P., Popovič, N., Färber, M., & Steinbach, P. (2026). Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering. arXiv:2602.00279v1. https://arxiv.org/abs/2602.00279v1

Key takeaway: Answer frequency (consistency across sampled generations) is the most reliable UQ method for instruction-tuned LLMs; verbalized confidence and token-level probabilities are systematically compromised. Never evaluate calibration with ECE alone. Open-source framework: https://github.com/muelphil/llm-uncertainty-bench.