name: margin-runtime-confidence-calibration description: "Multi-Agent Runtime Grading via Incremental Normalization (MARGIN) — online confidence calibration for multi-agent AI coordination. Use when building multi-agent systems that need to weight agent trustworthiness at runtime: (1) coordinating responses from multiple foundation models, (2) selecting which agent's output to trust when self-reported confidence is unreliable, (3) calibrating confidence under distribution shift without held-out data or retraining. arxiv_id: "2605.22949" published: "2026-05-21" authors: "Joss Armstrong" tags: [multi-agent, confidence-calibration, online-learning, foundation-models, agent-coordination, trustworthiness]
MARGIN: Runtime Confidence Calibration for Multi-Agent Coordination
Core methodology from arXiv:2605.22949 (2026).
Core Concept
MARGIN (Multi Agent Runtime Grading via Incremental Normalization) is an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself. It requires no model access, no held-out data, and no retraining — making it ideal for black-box multi-agent deployments.
Key insight: Foundation model confidence is systematically miscalibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration (temperature scaling, Platt scaling) degrades under distribution shift. MARGIN fixes this by learning online from the task stream.
Algorithm
MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending:
- Confidence bands — Partition [0,1] confidence into B bands
- Per-band calibration factors — For each agent a and band b, maintain:
E_win[a][b]— exponentially weighted average of correct answersN[a][b]— effective sample count for shrinkage
- Bayesian shrinkage — Blend per-band accuracy towards global average when counts are low
- Symmetric updates — Update both the chosen agent AND non-chosen agents to avoid strategic manipulation
Three hyperparameters with robust defaults:
α(smoothing rate, default 0.05) — Controls how quickly old observations decayκ(shrinkage floor, default 5) — Minimum effective count before per-band estimate dominatesβ(bias toward prior, default 0.5) — Strength of shrinkage towards global mean
Key Findings
- Raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks
- MARGIN corrects this completely, raising pairwise resolution to 70-89%
- Surpasses the always-best-model oracle on 3 of 4 benchmarks
- Achieves 3-6x lower calibration error than best design-time baseline under distribution shift
- Validated across 19 foundation models, 8 benchmarks, 50,000+ observations
Implementation Pattern
def margin_calibrate(agent_confidence: dict[str, float],
calibration_factors: dict) -> dict[str, float]:
# Shrink per-band factor toward global mean
for agent, conf in agent_confidence.items():
band = discretize(conf, num_bands=N_BANDS)
factor = calibration_factors[agent][band]
if factor.count < KAPPA:
factor = blend_with_global(factor, global_avg, BETA)
agent_confidence[agent] *= factor
return agent_confidence
def margin_update(selected_agent: str, was_correct: bool,
all_agents: list[str],
calibration_factors: dict) -> None:
# Symmetric update: update all agents
for agent in all_agents:
if agent == selected_agent:
update_factor(agent, was_correct, ALPHA)
else:
# Non-selected agents: they would have been wrong
# if chosen (conservative bound)
update_factor(agent, not was_correct, ALPHA)
Applications
- Multi-agent routing — Select which agent to query per task
- Ensemble weighting — Weight model votes by calibrated confidence
- Confidence-aware abstention — Only act when calibrated confidence exceeds threshold
- Online adaptation — Deploy to continuously changing data streams
Activation Keywords
MARGIN, Multi-agent confidence calibration, runtime calibration, online calibration, agent coordination, foundation model trust, confidence band calibration, Bayesian shrinkage smoothing, multi-agent selection, self-reported confidence correction, ensemble weighting calibration