name: uncertainty-quantification description: Contract for STAGE 5b UNCERTAINTY. Modeler exposes an outcome_fn in models/outcome_fn.py that takes sampled parameter values (from the effect-size priors registry) and returns decision outputs as a dict. The UQ script samples N draws from each registered parameter's prior, runs outcome_fn once per draw, and aggregates per-output 95% credible intervals into uncertainty_report.yaml. Scalar outputs (DALYs, costs, burden) get CIs; categorical outputs (per-archetype package choice, per-LGA assignment) get stability distributions. The writer must report these posterior-derived CIs as the primary uncertainty claim — NOT an ensemble ±X% perturbation. The gate blocks ACCEPT without uncertainty_report.yaml. Use when writing outcome_fn, interpreting posterior CIs, or deciding whether to use cloud compute (see cloud-compute skill). Trigger phrases include "uncertainty quantification", "UQ", "propagate priors", "outcome_fn", "posterior CI", "credible interval", "ensemble uncertainty".
Uncertainty Quantification via Prior Propagation
Why this stage exists
The malaria run's reported "±13% on DALYs" was 2.4–8× too narrow. It came from perturbing two calibrated parameters by ±20% across 3 seeds — an ensemble of computational replicates, not a posterior from literature priors. A trained modeler propagates uncertainty from the source-paper CIs that define intervention effect sizes, cost estimates, calibration targets. That's what the UQ probe demonstrated and what this stage implements.
The modeler's contract: outcome_fn
Every run must expose a file at {run_dir}/models/outcome_fn.py
containing a deterministic callable:
def outcome_fn(params: dict) -> dict:
"""
Run the decision-relevant portion of the model under a specific
parameter set, return the outputs we care about for uncertainty
analysis.
Args:
params: {parameter_name: sampled_value}. Keys match the `name`
field of entries in citations.md `## Parameter Registry`.
Returns:
dict of {output_name: scalar_or_str}:
- scalars → aggregated as 95% credible intervals
- strings → aggregated as stability distributions
(e.g., which package was optimal per archetype in this draw)
Must be deterministic given its inputs (seed random state using a
per-draw seed derived from the hash of params if you need stochasticity).
Must be cheap enough that 200 invocations complete in ~15 min locally
OR the modeler must build a cloud-parallel path (see the
cloud-compute skill).
"""
When the full ABM is too slow: build a surrogate
If the full model takes >5 minutes per invocation, 200 draws will take
16 hours locally. Build a surrogate:
- Run the full ABM on a sparse grid (e.g., 30 parameter combinations spanning the prior supports).
- Fit a smooth emulator (Gaussian process, neural net, response surface, or just multiple linear regression on log-parameters) to predict outputs from parameters.
- Expose the emulator via
outcome_fn. The UQ stage runs the emulator, not the full ABM. - Document the emulator's RMSE against the grid points in
{run_dir}/models/outcome_fn_calibration.md. A surrogate with RMSE10% of the mean output is suspect — either the grid is too sparse or the emulator is under-parameterized.
Example outcome_fn (malaria-style allocation)
# models/outcome_fn.py
import sys, os
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from level3_interventions import (compute_archetype_ce,
greedy_budget_allocate)
from level2_archetype_starsim import load_archetype_data
# Pre-load invariant data once at import.
_DATA = load_archetype_data()
def outcome_fn(params: dict) -> dict:
"""
params is a dict sampled from the parameter registry. We override
the default intervention effect sizes with sampled values, rerun
the cost-effectiveness + allocation step, and return decision outputs.
This uses a precomputed ABM response surface (see
models/outcome_fn_calibration.md) rather than re-running the full
Starsim ABM 200 times.
"""
# Override the defaults with sampled values.
effects = {
'irs': params.get('irs_rr', 0.35),
'smc': params.get('smc_rr', 0.27),
'itn_pbo': params.get('itn_pbo_rr', 0.45),
'itn_dual_ai': params.get('itn_dual_ai_rr', 0.45),
}
costs = {
'irs_per_py': params.get('irs_cost_per_py', 3.20),
'itn_pbo_per_net': params.get('itn_pbo_unit_cost', 10.0),
'itn_dual_per_net': params.get('itn_dual_ai_unit_cost', 12.50),
}
archetype_ce = compute_archetype_ce(_DATA, effects, costs)
allocation = greedy_budget_allocate(archetype_ce, budget=320e6)
# Scalar: total DALYs averted, total budget spent
# Categorical: per-archetype package choice (for stability analysis)
out = {
'dalys_averted_5yr': float(allocation['total_dalys_averted']),
'budget_spent_usd': float(allocation['total_cost']),
'n_lgas_allocated': int(allocation['n_lgas']),
}
for arch_id, pkg in allocation['per_archetype_package'].items():
out[f'package_{arch_id}'] = pkg # string → stability distribution
return out
UQ script contract
Invoke from lead or manually:
python3 scripts/propagate_uncertainty.py {run_dir} --n-draws 200
Writes {run_dir}/uncertainty_report.yaml. Schema:
n_draws: 200
seed: 42
n_errors: 0 # number of draws where outcome_fn raised
scalar_outputs:
dalys_averted_5yr:
mean: 7.41e6
median: 7.28e6
ci_low: 5.45e6 # 2.5th percentile
ci_high: 9.66e6 # 97.5th percentile
n: 200
budget_spent_usd:
mean: 318.2e6
ci_low: 311.5e6
ci_high: 320.0e6
n: 200
categorical_outputs:
package_A1:
counts: {itn_pbo_80: 193, itn_dual_80: 7}
dominant: itn_pbo_80
dominance: 0.965 # fraction of draws choosing the dominant
n: 200
package_A2:
counts: {itn_pbo_80: 154, smc_80: 46}
dominant: itn_pbo_80
dominance: 0.770
parameter_samples: # what got drawn
irs_rr:
mean: 0.35
ci_low: 0.27
ci_high: 0.44
...
The writer's primary-CI rule (Commit A of Phase 1.5 + this)
When reporting uncertainty on a quantitative claim:
- Primary: the CI from
uncertainty_report.yaml. This is the CI that propagates source-paper uncertainty through the decision. - Secondary (optional): ensemble-perturbation CIs from the ABM itself, clearly labeled as "computational replicate uncertainty only."
The writer MUST NOT report ensemble CIs as the primary uncertainty when
uncertainty_report.yaml exists. Decision-makers need to see source-CI
width; computational replicate width misleads them into over-confidence.
Categorical stability: what it means for policy
categorical_outputs gives you per-category stability. Interpretation:
- dominance > 0.95: robust decision. Report as a confident recommendation ("PBO-ITN dominates in this archetype in 96% of posterior draws").
- 0.70 < dominance < 0.95: moderately robust. Report as "likely optimal but contingent on parameter values in the published-CI range."
- dominance < 0.70: policy choice is uncertain. The report must flag this as an ICER coin-flip and either recommend additional data collection OR present both options as defensible.
Ignoring stability and reporting the point-estimate choice as "optimal" is a HIGH misrepresentation under the writer prompt rule.
When to use cloud compute
See the cloud-compute skill for the decision rule. Quick version:
- outcome_fn < 2s/call: run locally (200 draws × 2s = 7 min).
- outcome_fn 2–60s/call: run locally if you can wait 30 min–3 hr, otherwise use cloud.
- outcome_fn > 60s/call: MUST use cloud, OR build a surrogate.
- >1000 draws needed (e.g., for per-LGA stability in 774 units, not per-archetype): use cloud.
Gate behavior
The STAGE 7 validator blocks ACCEPT if:
uncertainty_report.yamlis missing AND at least one parameter has CIs in the registry (i.e., UQ is applicable).uncertainty_report.yamlhasn_errors > n_draws / 4(outcome_fn raised on >25% of draws — the surrogate is broken or misspecified).
MEDIUM blockers fire if:
n_draws < 100(insufficient for stable 2.5–97.5 percentiles).- Any scalar output's CI width > 3× its mean (suggests the priors dominate the data — modeler may need to tighten priors to data-consistent posteriors via MCMC, a Phase 3 job).
- Any categorical output has
dominance < 0.50(coin-flip; writer must present as a genuine toss-up).
Round-2+ behavior
On re-runs, invoke with same --seed for reproducibility. Changes in
uncertainty_report.yaml between rounds reflect either (a) modeler
updates to outcome_fn or (b) registry parameter changes. Both are
audit-visible in git history.
Minimal policy on correlated priors
The current implementation samples each registered parameter independently. In reality some parameters are correlated (e.g., IRS and ITN efficacy share pyrethroid susceptibility mechanics). If you know a correlation exists and matters:
- Document it in
citations.md## Parameter Registryas acorrelations:block (schema TBD — for now just narrative notes in thenotes:field). - Implement the correlation inside
outcome_fnby transforming independently-sampled params into correlated ones at the top.
A future enhancement to propagate_uncertainty.py will support declared
correlations at the registry level. Until then, correlated priors are
the modeler's responsibility inside outcome_fn.