policyengine-simulation-mechanics

name: policyengine-simulation-mechanics description: | ALWAYS LOAD THIS SKILL before writing any policyengine.py microsimulation code. Contains correct import paths, environment setup, dataset loading, and analysis patterns. Triggers: "write a script", "policyengine.py", "microsimulation script", "run a simulation", "load the dataset", "FRS", "EFRS", "enhanced FRS", "CPS", "enhanced CPS", "by income decile", "by tenure", "by region", "energy spending", "domestic energy", "household net income", "output_dataset", "ensure_datasets", "uk_datasets", "us_datasets", "import datasets", "from policyengine", "Simulation(dataset=", "uk_latest", "us_latest", "plotly", "analysis script", "decile breakdown", "percentile", "groupby", "weighted", "mean", "median", "p25", "p75", "tenure type", "income band", "policy reform script".

PolicyEngine Simulation Mechanics

This skill covers advanced patterns for working with policyengine.py simulations, including caching, result access, and entity mapping.

CRITICAL: Environment Setup

Before writing any code, check the environment. The policyengine.py package must be installed in the project's .venv.

# Always run from the policyengine.py repo root:
cd /path/to/policyengine.py
uv run python script.py

# Or activate first:
source .venv/bin/activate
python script.py

# NEVER use bare `pip install` — always:
uv pip install -e ".[uk]"   # for UK work
uv pip install -e ".[us]"   # for US work

If from policyengine.core import Simulation fails:

cd /path/to/policyengine.py
uv pip install -e ".[uk]"
# Then re-run with: uv run python script.py

CRITICAL: Correct Import Paths

Only these imports exist — do not guess others:

# Core simulation
from policyengine.core import Simulation

# UK model
from policyengine.tax_benefit_models.uk import (
    uk_latest,          # The model version (pass as tax_benefit_model_version=)
    uk_model,           # The model itself
    PolicyEngineUKDataset,
    UKYearData,
    create_datasets,    # Create & cache datasets from HF source
    load_datasets,      # Load cached datasets from disk
    ensure_datasets,    # Create if missing, load if present (recommended)
)

# US model
from policyengine.tax_benefit_models.us import (
    us_latest,
    PolicyEngineUSDataset,
    ensure_datasets,
)

# Outputs
from policyengine.outputs.aggregate import Aggregate, AggregateType
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType

# Plotting
from policyengine.utils.plotting import COLORS, format_fig

There is NO:

policyengine.core.dataset_registry
policyengine.datasets
policyengine.core.dataset_version.DatasetVersion.list()

UK Datasets

Loading UK datasets

Use ensure_datasets() — it returns a dict[str, PolicyEngineUKDataset], building files in ./data/ on first run and loading from disk on subsequent runs.

WARNING: from policyengine.tax_benefit_models.uk import datasets gives you the Python submodule, not a dict. Never index it like a dict.

from policyengine.tax_benefit_models.uk import ensure_datasets

uk = ensure_datasets(
    datasets=[
        "hf://policyengine/policyengine-uk-data/frs_2023_24.h5",
        "hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5",
    ],
    years=[2026],
    data_folder="./data",
)

efrs = uk["enhanced_frs_2023_24_2026"]
frs  = uk["frs_2023_24_2026"]

Dict key format: "{stem}_{year}" e.g. "enhanced_frs_2023_24_2026"

To force regeneration: delete ./data/ and call ensure_datasets() again.

Loading US datasets

from policyengine.tax_benefit_models.us import ensure_datasets

us = ensure_datasets(
    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
    years=[2026],
    data_folder="./data",
)

ecps = us["enhanced_cps_2024_2026"]

Default US dataset: enhanced_cps_2024.h5 (Enhanced CPS), years 2024–2028.

Inspecting available variables

Always inspect the dataset to find available variable names — never guess:

from policyengine.tax_benefit_models.uk import ensure_datasets

uk = ensure_datasets(years=[2026], data_folder="./data")
d = uk["enhanced_frs_2023_24_2026"]

# Input variables (present in raw data)
print("household:", list(d.data.household.columns))
print("person:   ", list(d.data.person.columns))
print("benunit:  ", list(d.data.benunit.columns))

Input variables are what's in the raw survey data — demographics, reported incomes, consumption, wealth, flags.

Computed variables (household_net_income, income_tax, universal_credit, etc.) are not in the raw dataset — they are calculated by the simulation. To see what's available after running:

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest

sim = Simulation(dataset=d, tax_benefit_model_version=uk_latest)
sim.run()

print("household (post-sim):", list(sim.output_dataset.data.household.columns))
print("person (post-sim):   ", list(sim.output_dataset.data.person.columns))

The computed variables available are defined by uk_latest.entity_variables — inspect this to see the full list without running a simulation:

from policyengine.tax_benefit_models.uk import uk_latest
print(uk_latest.entity_variables)  # dict: entity → [variable names]

For Analysts: Core Concepts

When running simulations with policyengine.py (the microsimulation package, not the API client), you work with three key components:

Simulation.ensure() - Smart caching to avoid redundant computation
simulation.output_dataset.data - Accessing calculated results
map_to_entity() - Converting data between entity levels (person ↔ household)

Note: This is for microsimulation with policyengine.py, not the policyengine Python API client (which uses Simulation(situation=...)).

Simulation Lifecycle

The Four Methods

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)

# Method 1: Always run (no caching)
simulation.run()

# Method 2: Run only if needed (recommended)
simulation.ensure()

# Method 3: Save results to disk
simulation.save()

# Method 4: Load results from disk
simulation.load()

When to Use Each

run(): Use when you need fresh results or parameters changed
ensure(): Use for iterative development (checks cache → disk → run)
save(): Use to persist large simulation results
load(): Use to resume from previous session

How ensure() Works

def ensure(self):
    # 1. Check in-memory LRU cache (100 simulations)
    cached = _cache.get(self.id)
    if cached:
        self.output_dataset = cached.output_dataset
        return

    # 2. Try loading from disk
    try:
        self.tax_benefit_model_version.load(self)
    except Exception:
        # 3. Only run if both cache and disk fail
        self.run()
        self.save()

    # 4. Add to cache for next ensure() call
    _cache.add(self.id, self)

Performance impact:

First call: Full simulation runtime (seconds to minutes)
Same session: Instant (in-memory cache)
New session: Fast (disk load, no recomputation)

Example: Reusing Baseline Across Reforms

# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()  # First call: runs simulation
baseline.save()    # Persist to disk

# Test multiple reforms
for reform in [reform1, reform2, reform3]:
    baseline.ensure()  # Instant from cache!

    reform_sim = Simulation(
        dataset=dataset,
        tax_benefit_model_version=uk_latest,
        policy=reform
    )
    reform_sim.run()  # Only reform needs to run

    # Compare results...

Accessing Results: output_dataset.data

After running a simulation, all calculated variables are in simulation.output_dataset.data.

Structure (UK Example)

simulation.run()

# Access output container
output = simulation.output_dataset.data

# Entity-level MicroDataFrames
output.person      # Person-level results
output.benunit     # Benefit unit results
output.household   # Household-level results

US Entity Structure

# US has more entities
output.person
output.tax_unit       # Federal tax filing unit
output.spm_unit       # Supplemental Poverty Measure unit
output.family         # Census family definition
output.marital_unit   # Married couple or single
output.household

Available Variables

Each dataframe contains input variables + calculated variables:

# Person-level (UK)
print(output.person.columns)
# ['person_id', 'person_household_id', 'age', 'employment_income',
#  'income_tax', 'national_insurance', 'net_income', ...]

# Household-level (UK)
print(output.household.columns)
# ['household_id', 'region', 'rent', 'household_net_income',
#  'household_benefits', 'household_tax', ...]

# Benunit-level (UK)
print(output.benunit.columns)
# ['benunit_id', 'universal_credit', 'child_benefit',
#  'working_tax_credit', 'child_tax_credit', ...]

Direct Data Access

# Get specific columns
incomes = output.household[["household_id", "household_net_income"]]

# Filter data
high_earners = output.person[output.person["employment_income"] > 100000]

# Calculate statistics (automatically weighted!)
mean_income = output.household["household_net_income"].mean()
total_tax = output.household["household_tax"].sum()

# Access individual values
first_hh_income = output.household["household_net_income"].iloc[0]

MicroDataFrame Automatic Weighting

All operations respect survey weights automatically:

# These are all weighted calculations
total_population = output.person["person_weight"].sum()
mean_income = output.household["household_net_income"].mean()
poverty_rate = output.household["in_absolute_poverty_bhc"].mean()

# Groupby operations are weighted
by_region = output.household.groupby("region")["household_net_income"].mean()

Entity Mapping with map_to_entity()

Convert data between entity levels (e.g., sum person income to household, or broadcast household rent to persons).

Method Signature

output.map_to_entity(
    source_entity: str,        # Entity to map from
    target_entity: str,        # Entity to map to
    columns: list[str] = None, # Columns to map (None = all)
    values: np.ndarray = None, # Custom values instead
    how: str = "sum"           # Aggregation method
)

Aggregation Methods

Person → Group (aggregation):

how="sum" (default): Sum values within each group
how="first": Take first value in each group
how="mean": Average values
how="max": Maximum value
how="min": Minimum value

Group → Person (expansion):

how="project" (default): Broadcast group value to all members
how="divide": Split group value equally among members

Example 1: Sum Person Income to Household

# Sum employment income across all people in each household
household_employment = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)

# Result is MicroDataFrame at household level
print(household_employment.columns)
# ['household_id', 'employment_income']  # Now household total

Example 2: Broadcast Household Rent to Persons

# Give each person their household's rent value
person_rent = output.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["rent"],
    how="project"
)

# Each person now has their household's rent
print(person_rent.columns)
# ['person_id', 'rent']

Example 3: Divide Household Value Per Person

# Split household savings equally among members
person_savings_share = output.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["total_savings"],
    how="divide"
)

# If household has £12,000 savings and 3 people, each gets £4,000

Example 4: Map Custom Values

import numpy as np

# Calculate custom person-level values
custom_tax = np.where(
    output.person["employment_income"] > 50000,
    output.person["income_tax"] * 1.1,  # 10% increase for high earners
    output.person["income_tax"]
)

# Aggregate to household level
household_custom_tax = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=custom_tax,
    how="sum"
)

Example 5: Multi-Column Mapping

# Map multiple income sources to household level
household_incomes = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=[
        "employment_income",
        "self_employment_income",
        "pension_income",
        "savings_interest_income"
    ],
    how="sum"
)

# Result has all columns at household level

Example 6: Cross-Entity Mapping (Group to Group)

# UK: Map benunit benefits to household level
# (Multiple benunits can exist in one household)
household_uc = output.map_to_entity(
    source_entity="benunit",
    target_entity="household",
    columns=["universal_credit", "child_benefit"],
    how="sum"
)

Automatic Mapping in Aggregate Classes

The Aggregate and ChangeAggregate classes automatically handle entity mapping when the variable and target entity don't match:

from policyengine.outputs.aggregate import Aggregate, AggregateType

# income_tax is person-level, but we want household-level sum
total_tax = Aggregate(
    simulation=simulation,
    variable="income_tax",  # Person-level
    entity="household",      # Household-level aggregation
    aggregate_type=AggregateType.SUM,
)
total_tax.run()
# Automatically maps income_tax from person to household using sum()

Common Patterns

Pattern 1: Compare Baseline vs Reform

# Run both simulations
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()

reform = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
    policy=reform_policy
)
reform.ensure()

# Get outputs
baseline_out = baseline.output_dataset.data
reform_out = reform.output_dataset.data

# Calculate differences
baseline_income = baseline_out.household["household_net_income"]
reform_income = reform_out.household["household_net_income"]

difference = reform_income - baseline_income

# Count winners/losers (weighted)
winners = (difference > 0).sum()
losers = (difference < 0).sum()
unchanged = (difference == 0).sum()

Pattern 2: Calculate Custom Derived Variable

# Calculate marginal tax rate at person level
person_data = output.person.copy()
person_data["mtr"] = (
    (person_data["income_tax"] + person_data["national_insurance"])
    / person_data["employment_income"].clip(lower=1)
) * 100

# Map to household level (max MTR in household)
household_mtr = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=person_data["mtr"].values,
    how="max"
)

Pattern 3: Extract Subset for Analysis

# Get London households with children
london_hh = output.household[output.household["region"] == "LONDON"]

households_with_children = output.person.groupby("person_household_id")["age"].apply(
    lambda ages: (ages < 18).any()
)

# Combine filters
london_ids = set(london_hh["household_id"])
hh_with_kids_ids = set(households_with_children[households_with_children].index)
target_ids = london_ids & hh_with_kids_ids

# Extract subset
subset_hh = output.household[output.household["household_id"].isin(target_ids)]
subset_persons = output.person[output.person["person_household_id"].isin(target_ids)]

Pattern 4: Reuse Baseline Across Multiple Reforms

# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
baseline.save()

# Test multiple reforms efficiently
reforms = [reform1, reform2, reform3]
results = {}

for reform in reforms:
    baseline.ensure()  # Instant from cache

    reform_sim = Simulation(
        dataset=dataset,
        tax_benefit_model_version=uk_latest,
        policy=reform
    )
    reform_sim.run()

    # Calculate impact
    from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType

    revenue = ChangeAggregate(
        baseline_simulation=baseline,
        reform_simulation=reform_sim,
        variable="household_tax",
        aggregate_type=ChangeAggregateType.SUM,
    )
    revenue.run()

    results[reform.name] = revenue.result

Direct Data Analysis (without Aggregate)

For custom analyses (decile breakdowns, percentiles, groupby), work directly with output_dataset.data after running the simulation. This is often simpler than using Aggregate.

Full working example: energy spending by income decile and tenure type

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, uk_latest

# Load dataset
dataset = PolicyEngineUKDataset(
    name="Enhanced FRS 2026",
    description="EFRS 2026",
    filepath="./data/enhanced_frs_2023_24_year_2026.h5",
    year=2026,
)
dataset.load()

# Run simulation to compute income variables
simulation = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
simulation.ensure()  # caches to disk after first run

# Access results as DataFrames
hh = simulation.output_dataset.data.household

# Assign income decile (weighted quantile)
hh["income_decile"] = pd.qcut(
    hh["household_net_income"],
    q=10,
    labels=[f"D{i}" for i in range(1, 11)],
)

# Group and calculate stats
stats = (
    hh.groupby(["income_decile", "tenure_type"])["domestic_energy_consumption"]
    .agg(
        mean="mean",
        p25=lambda x: np.percentile(x, 25),
        p75=lambda x: np.percentile(x, 75),
    )
    .reset_index()
)

Key points:

simulation.output_dataset.data.household is a MicroDataFrame with weights
domestic_energy_consumption is household-level (annual £)
tenure_type values: OWNED_OUTRIGHT, OWNED_WITH_MORTGAGE, RENT_FROM_COUNCIL, RENT_PRIVATELY, RENT_FROM_HA
Income deciles must be computed from simulation output (not raw data)

Performance Tips

Use ensure() for iterative work: Can save minutes when re-running analyses
Filter before mapping: Reduces computation on large datasets
Use Aggregate classes: Optimised implementations for common operations
Batch similar calculations: Run multiple aggregates in sequence
Cache intermediate results: Store derived calculations

# Good: Filter then map
high_earners = output.person[output.person["employment_income"] > 100000]
high_earner_hh_income = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=high_earners["employment_income"].values,
    how="sum"
)

# Less efficient: Map then filter
all_hh_income = output.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)
high_earner_hh = all_hh_income[all_hh_income["employment_income"] > 100000]

For Contributors: Implementation

Current implementation:

# Simulation lifecycle
cat policyengine.py/src/policyengine/core/simulation.py

# Entity mapping logic
cat policyengine.py/src/policyengine/core/dataset.py

# Cache implementation
cat policyengine.py/src/policyengine/core/cache.py

Key patterns:

Simulation caching: LRU cache with max 100 entries, keyed by UUID
Entity mapping: Automatic detection of mapping direction (person→group or group→person)
MicroDataFrame: All entity data uses weighted DataFrames from microdf package

Related skills:

policyengine-core-skill - Understanding simulation engine architecture
microdf-skill - Working with weighted DataFrames
policyengine-python-client-skill - Basic simulation usage

Debugging Tips

Verify Simulation Ran

assert simulation.output_dataset is not None, "Simulation hasn't run"

# Check for expected variables
expected = ["household_net_income", "household_tax"]
actual = simulation.output_dataset.data.household.columns
assert all(v in actual for v in expected), "Missing variables"

Check Entity Linkages

# Verify person-household mapping is valid
person_hh_ids = set(output.person["person_household_id"])
household_ids = set(output.household["household_id"])
assert person_hh_ids.issubset(household_ids), "Invalid linkage"

Verify Weights

# Check weights sum correctly
total_persons = output.person["person_weight"].sum()
print(f"Weighted population: {total_persons:,.0f}")

# Check for missing weights
assert not output.person["person_weight"].isna().any(), "Missing weights"