bisect-perf-regression - SKILL.md Agent Skill

name: bisect-perf-regression description: Systematically bisect an end-to-end performance regression between two refs (tags/branches/commits) down to the exact offending commit(s). Use this when CI/customer benchmarks show one version is measurably slower than another and you need to attribute the cost to specific PRs.

Bisect a performance regression

Pin a perf regression to specific commits using drift-cancelled, repeated, cold benchmarks.

You will need (or need to build) a small harness around the workload of interest. The exact shape of the harness does not matter, as long as it satisfies the properties described in step 1 below (interleaved labels, cold per-iteration state, median + CV reporting).

Arguments

$ARGUMENTS should identify:

A good ref (older / fast) and a bad ref (newer / slow)
A reproducible workload and its parameters (workload size, dataset, mode), expressed as whatever the harness consumes (env vars, CLI flags, config file)
An expected gap (so you can stop early when a real step is found)

Instructions

1. Reproduce the gap stably before bisecting

Before bisecting anything, prove the two endpoints are reliably distinguishable.

Use an interleaved harness that restarts the server (or the system under test) per iteration so any cached state is cold each run, drives the workload round-robin across labels (so environmental drift attaches equally to every label), and reports median + CV per label. Sequential "all N runs of label A, then all N runs of label B" attributes any host drift across the wall-clock window entirely to the second label, and is the single largest source of false positives in perf bisection.
Pin CPU count (e.g. --cpus=8 for a Docker harness) and run on a quiet Linux host. Shared/laptop Docker setups (macOS Docker Desktop, Codespaces, noisy CI runners) are generally too jittery for sub-3% deltas.

Pick N_RUNS adaptively from observed CV — don't burn 5–8 runs by default.

Start at N_RUNS=3 (the minimum that yields a stdev/CV).

After the run, inspect the per-label CV printed by the harness:

observed max CV across labels	action
≤ 1.5 %	3 runs is enough, accept the medians
1.5 – 2.5 %	rerun with `N_RUNS=5` for the labels involved in the suspect step
2.5 – 4 %	rerun with `N_RUNS=8`, and only trust deltas ≥ 2 × max CV
> 4 %	the host is too noisy — fix it before bisecting (see below)

A claim is "real" when the median gap exceeds max(3 %, 2 × max CV across the two labels).

If CV stays > 4 % even at N_RUNS=8, increase the per-run workload size, or fix the host (other processes, CPU governor, thermal throttling, neighbour VMs) before continuing — bisection on noisy data attributes regressions to the wrong commit.

2. Bracket with version tags first, then commits

Always start with broad strokes and narrow down — never start at the commit level.

Tag ladder — list the released tags between good and bad and run them all in one interleaved sweep so they share drift. For example, with a project that has tags v2.10.12 … v2.10.30, run all of them as labels in a single interleaved sweep at N_RUNS=3. Look for the step — the adjacent pair with the largest jump. That's your new [lo, hi] interval.
First-parent commit bisect within [lo, hi]. Enumerate evenly-spaced first-parent commits that actually touched code (filter to src/*, deps/*, or the relevant subtrees — skip pure CI / docs / vendoring commits, since they cannot move performance). Build them, run the interleaved bench [lo, p1, p2, …, hi], and inspect adjacent deltas. Start at N_RUNS=3 and escalate per the CV table in step 1.
Iterate — pick the adjacent pair with the largest delta, recurse on that narrower interval. Stop when [lo, hi] is one commit, or when the remaining interval is below the noise floor.

3. Watch for compounding / multi-step regressions

A single endpoint number like "+30% slower" can hide several smaller steps that may compound through the same code path. Always print the full per-tag / per-commit ladder, not just endpoint deltas — a smooth slope versus a single cliff have very different fixes.

3a. Isolating residuals with the "fix-on-commit" technique

When a known regression dominates the gap and a fix is already in flight, apply that fix on top of every historical commit you benchmark. This neutralises the known cost so the ladder surfaces only the residual regressions that arrived alongside it. Without this technique, a large dominant cost can mask smaller but still real per-commit jumps.

Build the fix-on-commit variants by:

Creating a fix-on-<sha> branch off <sha>.
Applying the in-flight fix on top.
Building and benching that branch the same way you would a stock commit.

Use a code injector (a small script that locates the target function by signature regex, balances braces to find its body, and inserts the fix at a known anchor) rather than a static git apply patch when:

the target function's signature changes inside the bracketed range, or
surrounding code shifts line numbers across tags (3-line context is too small to disambiguate, and --ignore-whitespace won't save you).

Keep the injector idempotent (detect its own marker before injecting) so reruns are safe across rebuilds.

4. Confirm the offender

Once a single commit is suspected:

Read the diff. The change must plausibly cause the observed cost (e.g. new syscalls per token, new allocations per doc, new Rust↔C boundary calls).
Run a confirmation pair: [<commit>^, <commit>] only. Here it's worth spending more iterations than the bisect ladder used — start at N_RUNS=5 and escalate to 8 if max CV > 2.5 %. The delta should match the step size found during bisection to within 2 × max CV.
If you can, build a revert candidate (<bad> with just that commit reverted) and confirm it returns to <lo> performance. This guards against picking up an innocent commit that sits next to the real offender via a noisy run.

5. Honest reporting

Report only what the data supports:

Endpoint medians + CV
The bisection ladder with adjacent deltas
The candidate commit(s), with PR link, author, and a one-line cause hypothesis grounded in the diff
Residual unexplained delta, if any, with the noise floor

Do not round, do not collapse multi-step regressions into a single cause, and do not skip the confirmation step.

Common pitfalls

Cold-cache leakage into the first iteration — the first build in a session is usually slow due to docker layer pulls or compiler-cache warm-up. Pre-build all artifacts before starting the timed benchmark.
Wrong runtime/base image for the branch under test — each build must run against the runtime version it was compiled against (different minor branches often pin different runtime/host versions). Verify the harness picks the correct base image / runtime per label.
Auto-loaded bundled component — some base images auto-load a bundled version of the component under test via the entrypoint. Bypass that (e.g. override the entrypoint) so the component you actually want to measure is the only one loaded.
Toolchain / dependency drift across the bisect range — historical tags may need patches (compiler / library API breakage on newer hosts) or a specific toolchain pin (e.g. a particular nightly). Wrap the per-commit build in a script that detects the tag and applies the required patches / toolchain pin on demand.

Report Back

End with:

The good→bad ladder (medians, CV, adjacent %)
The implicated commit(s), short SHA + PR + title
The confirmation-pair numbers
A diff-grounded hypothesis for why the commit costs what it does
Residual gap, if any, and what would be needed to attribute it