name: scr-benchmark-optimizer version: 3.0.0 description: > Self-improvement loop for skill-code-review: benchmark it against competing AI code reviewers on the open Martian offline-50 set, then drive its bug-finding quality UP (recall and F1) and its review latency DOWN through a strictly gated experiment loop. Reviews run through the PRODUCT runner (the Python FSM + adaptive ThreadPoolExecutor), never a throwaway harness. The tracker is benchmarks/experiments.db (queried via benchmarks/experiments.py); every change is ONE attributable lever, proven by the EXACT 5-gate PROMOTE predicate over N rounds before it is tagged. Never stop the PROGRAM on a regression: revert the CHANGE and record a negative. Nothing is pushed until the USER authorizes it; never restructure reviewers.wiki without human confirmation. audience: ai-agents when_to_use: > Use whenever you want to measure or improve skill-code-review's bug-finding quality (or latency) vs competitors, after changing reviewers/handlers/dispatch/prompts/corpus, or to continue the optimization loop from the generated benchmarks/STATE.md position.
scr-benchmark-optimizer
The discipline that makes skill-code-review measurably better than competing AI
code reviewers and keeps it from getting lost. A strictly gated loop: measure (via
the product) -> judge consistently -> diagnose -> change ONE lever -> re-measure N
rounds -> apply the 5-gate predicate -> promote-and-tag or revert-and-record. The
two woven goals are HIGH-QUALITY reviews (recall and F1 flat-or-up) AND LOWER latency,
with every speed change held to the SAME 5-gate predicate so speed can never trade
away recall.
Three locks govern everything below (the anti-chaos engine): one change per round,
proof-before-scale, a never-stale generated state surface. Never stop the
PROGRAM on a regression; never apply an unproven heuristic; never change the SET or
STRUCTURE of reviewers.wiki without explicit human confirmation; never push to remote
until the USER personally authorizes it.
The canonical, checkbox-tracked master plan is
docs/plans/beating-competitors.md. This skill is the followable ESSENCE of its methodology (sections 6 and 6.3); read the plan for the full architecture, schemas, calibration design, and phase checklist. When the two disagree, the plan is authoritative and this skill is updated to match.
The benchmark and what "winning" means
withmartian/code-review-benchmark (MIT): 50 bug-fix PRs across 5 repos
(Sentry/py, Grafana/go, Cal.com/ts, Discourse/rb, Keycloak/java), ~136
human-verified golden comments, committed competitor candidate sets, judged by a
3-model panel (Claude Opus 4.5, Claude Sonnet 4.5, GPT-5.2). The judge rule: a
finding is a true positive only if it matches a golden; every non-golden finding
is a false positive (a deliberately harsh precision metric). So the goal is the
precision-recall frontier: catch the goldens (recall) while emitting few non-golden
findings (precision).
The real bar is the FULL 50, not a sub-slice. Committed offline leaderboard (vendor self-published snapshots, drifting +0.02 to 0.03 per monthly refresh):
| tool | recall | precision | F1 |
|---|---|---|---|
| qodo-extended (Mar 15) | 0.61 | 0.55 | ~0.64 |
| cubic-v2 (Mar 25) | 0.69 | 0.56 | 0.62 |
| augment | 0.61 | 0.47 | 0.54 |
| macroscope | 0.44 | 0.48 | 0.46 |
| bugbot | 0.44 | 0.47 | 0.45 |
| greptile-v4-1 | 0.48 | 0.40 | 0.44 |
The current bar is ~0.64 (Qodo Extended), not 0.62; Martian's stated recall ceiling is "no tool found more than 63% of known issues". Beating "all competitors" means F1 above the ~0.62 to 0.64 band on the full 50, on the frontier (do not trade all recall for precision). Treat the offline number as COMPARATIVE, not absolute: the gold set was seeded from Augment/Greptile data and the 5 repos may be memorized, so a real bug outside the gold set scores as a false positive.
Do NOT chase the 5-PR pilot number. The pilot (one PR/repo) is a SMALL, NOISE-BOUND, cubic-favorable slice; the full 50 is the honest number. Optimize for the full set, and reach it gradually (5, 10, 15, 20 ...), never in one budget-busting run.
Current standing (the generated state surface is the source of truth)
Read "where we are" from the generated surface, never from memory:
cd skill-code-review
uv run python benchmarks/experiments.py status # one-line current position
cat benchmarks/STATE.md # the live "YOU ARE HERE"
benchmarks/STATE.md is GENERATED by experiments.py state from the DB (never
hand-edited, same discipline as the generated wiki). It shows the current PR set and
rung, the baseline tag and HEAD-clean flag, baseline F1 with CI, recall, fp/PR,
dollars/review, stability, the last proven win, open hypotheses (cheapest-proven-win
first), the dead-ends ledger, and the single next action.
Baseline@pr5 is LOCKED (tag v-bench-0.1.0-pr5): the skill-prod-primary config over
the 5 pilot PRs x 3 rounds on the claude -p backend measured F1 mean 0.593, CI
[0.537, 0.652], recall 0.727, precision 0.500, fp/PR 1.60, F1 stdev 0.030.
Read this honestly: the F1 CI STRADDLES the ~0.62 to 0.64 competitor bar, so a powered
verdict (CI clearly above the bar) needs the PR ramp (more PRs), not more 5-PR rounds.
Confirm the current numbers from STATE.md, which always reflects the latest DB state.
The tracker: benchmarks/experiments.db (NOT a markdown log)
The committed SQLite DB benchmarks/experiments.db is the durable program state. It
carries the experiments, metrics, findings, dead_ends, and timings tables.
Query and mutate it ONLY through benchmarks/experiments.py (mutating subcommands
DRY-RUN by default; pass --apply to write). Auto-generated markdown views
(HISTORY.md, LEADERBOARD.md, STATE.md) keep a git diff legible.
The subcommands you will use most:
record(--apply) - ingest one measurement (config, PR set, point metrics, fp/PR, $/review, diff-stat, n-rounds, git-sha) as one experiment row.ci(--apply) - attach bootstrap CIs to a recorded run (recall/F1 lo+hi).gate- apply the EXACT 5-gate PROMOTE predicate (baseline run vs candidate run).compare- paired side-by-side of two runs.check <hypothesis> <lever> [--pr-set ...]- the dead-end guard; run it BEFORE you start an iteration. It warns loudly if a matching dead-end exists at this rung or smaller.state/history/leaderboard(--apply) - regenerate the markdown views.status- the one-line current position.slowest [--run-id ... --measured-only]- rank the slowest stages/agents by total wall time (measured rows only by default, so the hallucinated self-reportedruntime_msnever poisons the ranking). This is the latency-workstream lens.
The full per-finding label path (build_judge_input_prod.py carries
defect_confidence/severity/idx; verdicts record per-candidate matched;
ingest_verdicts.py populates findings) and the timing telemetry
(ingest_timings.py, timings table, slowest) are wired and live; do not rebuild them.
The harness (gitignored, under skill-code-review/tmp/)
The benchmark harness lives INSIDE the skill repo at skill-code-review/tmp/
(gitignored) - it travels with the skill it benchmarks. Reviews are driven by the
product, not a throwaway orchestrator. tmp/ holds only benchmark DATA
(regenerable, safe to delete); every durable harness/driver script lives in the
tracked skill-code-review/scripts/ dir (NOT in tmp/, NOT in the code_review
package; see the scr-scripts cookbook). Every output dir is run-id + pr-shard
nested (shard = first 2 hex of sha256(pr_id)) so no directory ever accumulates
thousands of entries (filesystem bottleneck): scripts/paths.py is the single source
of the layout; all driver scripts import it. A run-id is one round (base-pr5/iter1/
...). Layout:
tmp/repos/<ab>/<pr>/- materialised repo (shared across runs);scripts/setup_repo.pywrites here at the merge-base diff (base_difffrombench-index.json; usegit merge-base(base,head), NEVERbaseRefOid..head- OBSERVATIONS #8).tmp/runs/<run-id>/<ab>/<pr>/- one review's output (its.skill-code-review/<yyyy>/ <mm>/<dd>/<ab>/<rest5>/tree +run.log+timings.json).tmp/judge/<run-id>/<ab>/<pr>.json(+_input_<pr>.json) - judge verdicts/inputs.tmp/results/<run-id>/- per-run leaderboard/metrics.bench/- benchmark clone.scripts/build_judge_input_prod.py <pr> <run-id>- extract candidates (skill-prod / skill-prod-primary / skill-prod-scoped) + competitor sets, emitting each candidate as{text, defect_confidence, severity, idx}->judge/<run-id>/<ab>/_input_<pr>.json.scripts/score.py <run-id>- aggregate per-(PR,tool) verdicts ->results/<run-id>/.scripts/rerank.py <pr> <src-run-id> <out-run-id>- fast ranker-only loop: re-run the product'srank_findingsworker on an existing review's findings (isolates a finding-ranker.md change without re-rolling specialists; ~1 call/PR). This is the N=5 cheap loop (see "lever-typed N").scripts/ingest_verdicts.py/scripts/ingest_timings.py- populate the DBfindingsandtimingstables (dry-run by default,--applywrites).
Lever-typed N (how many rounds per change class)
The 5-PR signal is noise-bound (the ranker is stochastic; F1 swings +/-0.05 to 0.10 per round, rivalling the tuning deltas), so a single round is never evidence. Average over N rounds, where N depends on the change class:
- Ranker-only (a
workers/finding-ranker.mdchange, re-run viascripts/rerank.py, ~1 call/PR, cheap): N=5. - Full run (a specialist / routing / prompt / corpus / infra change, the full product re-run): N=3 (the floor).
Each round is one run_id; a Measurement is N rounds of the SAME config on the SAME frozen
PR set, reduced to mean + bootstrap 95% CI per metric (resample unit = the PR, 10k resamples).
A Measurement is the atomic unit of evidence; an Iteration (hypothesis, one change,
measurement, gate, promote-or-revert) is the atomic unit of development.
The gated experiment loop (one iteration)
HYPOTHESIS (falsifiable; names ONE lever and the metric it should move)
-> DEAD-END GUARD: experiments.py check <hypothesis> <lever> --pr-set <rung>
(if a matching dead-end exists at this rung or smaller: warn, pick another)
-> ONE CHANGE (single attributable lever; tracker stores `git diff --stat`)
-> GREEN GATE: ruff + mypy + pytest (must pass before a benchmark dollar is spent)
-> DOGFOOD (code changes only): run the PRODUCT reviewer on the diff, address findings
-> MEASURE: N rounds on the FROZEN current PR set (same backend, same judge model)
-> reduce to mean + bootstrap 95% CI per metric (experiments.py record + ci --apply)
-> PROVE: the 5-gate predicate (experiments.py gate)
all green -> commit + tag vX.Y.Z-<pr_set_id> + POSITIVE row + regen STATE.md
G1/G2/G4/G5 red -> git restore + NEGATIVE dead_ends row (retry_at_pr_set = NULL)
only G3 straddles -> git restore + INCONCLUSIVE row (retry_at_pr_set = next rung)
The dev gate ORDER per experiment is non-negotiable: (a) green first (ruff + mypy + pytest: a broken product can not review), then (b) dogfood (run the product reviewer on your own diff and ADDRESS its findings), then (c) the 5-gate benchmark. Green, then dogfood, then benchmark.
The 5-gate PROMOTE predicate (EXACT, from plan section 6.3)
Let B = baseline Measurement, C = candidate Measurement (each N rounds). Paired bootstrap deltas over PRs (10,000 resamples). Promote iff ALL FIVE hold:
GATE-1 RECALL non-regression: mean(recall_C) >= mean(recall_B)
AND lower-95%-CI(recall_C - recall_B) >= -0.03
GATE-2 NOISE non-regression: mean(fp_per_pr_C) <= mean(fp_per_pr_B)
AND upper-95%-CI(fp_per_pr_C - fp_per_pr_B) <= +0.30
GATE-3 PROGRESS (the teeth): lower-95%-CI(F1_C - F1_B) > 0 (paired delta-F1 CI strictly above 0)
GATE-4 STABILITY: stdev(F1_C over N) <= stdev(F1_B) + 0.02
GATE-5 COST guard: mean($/review_C) <= 1.25 * mean($/review_B)
GATE-1 + GATE-2 are the long-standing both-axes no-regression gate (recall up-or-equal AND
fp-per-PR down-or-equal), now made CI-aware. GATE-3 is the new requirement: a win must be
CI-separated from the baseline, not a lucky round. GATE-4 and GATE-5 stop erratic configs and
cost blowups. Outcomes: all green -> PROMOTE; GATE-1/2/4/5 red -> REVERT (genuine regression,
retry_at_pr_set = NULL); only GATE-3 straddles zero -> INCONCLUSIVE (revert,
retry_at_pr_set = next rung, the CI may separate at more PRs). Below ~12 PRs the predicate
returns inconclusive rather than promote (underpowered); that is expected at pr5 and is
why the ramp exists.
GATE-5 IS NOW LIVE-MEASURABLE. The product runner captures per-call cost (the claude -p
--output-format json envelope carries a full usage block + a list-price total_cost_usd;
codex/cursor fall back to a dependency-free ceil(chars/4) estimate) and rolls it up into
timings.json (per-specialist est_cost + a run-level cost block). scripts/ingest_timings.py
surfaces the per-review PROXY cost (cost_mean_proxy) to hand experiments.py record --cost,
which flips GATE-5 from FAIL-CLOSED (cost_baseline_missing -> inconclusive) to a real
comparison. $/review (cost_mean) is a PROXY for the RELATIVE ratio, NEVER billed spend:
under the flat claude -p subscription nothing is billed per review; one identical estimator
prices baseline and candidate, so bias cancels in the 1.25x ratio. Anchor GATE-5 on a FRESH
instrumented baseline over the same N rounds; never mix a retroactive char-estimate baseline
with live-usage candidates. The capture is PURE INSTRUMENTATION (no prompt change, no extra
call); the tier-demote cost LEVER is the next round, separate from this measurement infra.
For a SPEED change there is an extra CHEAP-FIRST check BEFORE the full N-round measurement:
one instrumented review (telemetry on) confirming (a) the targeted stage's wall-time dropped
roughly as predicted and (b) a leaf-set sanity check (the picked-leaf set did not lose any
leaf the baseline picked, especially no sec-*/correctness leaf). Only then is the N=3
measurement spent. Latency is a first-class metric on the row (whole_review_ms,
tree_descend+llm_trim ms) but it does NOT enter the predicate, so a config can never buy a
quality regression with speed. GATE-1 is the hard floor: revert the instant recall regresses
outside its CI, no matter how much latency it saved.
Single-change discipline (the anti "too much" lock)
ONE lever per iteration (ranker OR specialist OR routing OR one threshold OR one corpus batch,
never two). A change must be attributable to one diff; the tracker stores git diff --stat and
warns (requiring an explicit --force-multi with a written reason) when more than one logical
lever changed. No "while I am here" bundling. The PR set is FROZEN during an iteration (ramp is
a separate gated step). Green before measure.
The dead-ends ledger (never re-walk a rabbit hole)
Every Measurement, pass or fail, writes a DB row. The dead_ends table keys on
hypothesis_hash = sha256(normalized hypothesis + lever) with why_failed (which gate) and
retry_at_pr_set. Before starting an iteration, experiments.py check <hypothesis> <lever> --pr-set <rung> warns loudly if a matching dead-end exists at the current or smaller PR set
(a failed hypothesis at N PRs is still a dead-end at N or fewer; it may only be retried at a
LARGER rung if it was recorded INCONCLUSIVE with retry_at_pr_set set). Mirror one-line
summaries into tmp/OBSERVATIONS.md and (optionally) the shared RAG memory via save_lesson
so other sessions inherit the dead-ends. Never re-walk a failed hypothesis at the same or
smaller PR set.
Proof-before-scale ramp (grow the PR set only after a tagged win)
Ramp from rung k to k+1 iff ALL hold: RAMP-1 at least one PROMOTED, tagged win banked at rung
k; RAMP-2 HEAD == the rung-k tag, tree clean; RAMP-3 the promoted baseline's f1_stdev <= 0.06;
RAMP-4 projected N-round cost at k+1 within the remaining budget. Ramping is NOT an iteration:
it FREEZES the new superset PR set (the set always grows, never shrinks), runs the CURRENT
promoted config N rounds to form Baseline@(k+1), tags it (-pr<k+1> suffix only), and records
a RAMP row. Rollback rule: code changes roll back on a regression, but the PR set NEVER rolls
backward (more PRs is always more honest). If a win fails to generalize at k+1 (paired comparison
on the shared subset regresses), stay at k+1, flag regression_on_ramp, and open an "overfit the
rung-k slice" hypothesis.
DOGFOOD: skill-code-review reviews its own code changes
Every CODE change to skill-code-review (the Python in code_review/, the FSM, dispatch,
handlers, runner) must be reviewed by skill-code-review ITSELF before acceptance. The dev gate
order per experiment is:
Green -
ruff check code_review/ tests/+mypy code_review/+pytestall pass. A broken product can not review, so green comes first.Dogfood - run the PRODUCT reviewer on the diff and ADDRESS its findings:
cd skill-code-review export GITHUB_TOKEN="$(gh auth token)" uv run python -m code_review.cli review --base <merge-base> --head HEAD --backend claudeTreat its
primaryfindings as you would any reviewer's: fix them or justify the dismissal in the iteration notes. A product that can not pass its own review has no business reviewing others.Benchmark - the F1 5-gate (above).
DOGFOOD targets CODE (Python). Pure docs or markdown edits are EXEMPT (this skill, the rule
files, the plan, and other .md/doc-only changes do not trigger the dogfood pass), but they still
require green (ruff + mypy + pytest, which pass trivially when no code changed) before commit.
NO PUSH: the user is the gate
Nothing is pushed to remote (no git push, no PR, no merge, no touching main) until the FINAL
number-1 AI-code-reviewer goal is reached AND the USER personally authorizes the push. All work
stays LOCAL on the feat/martian-benchmark-program branch. Commit and tag locally as the loop
prescribes (each promoted win is a local commit + a local vX.Y.Z-<pr_set_id> tag), but the
remote is untouched until the user says so. This is a HARD gate: do not push "just the docs", do
not open a "draft" PR, do not merge to main. The user is the gate.
ULTRACODE: how this program is staffed
All code-or-thinking subagent work on this program runs at ultracode intensity: parallel deep work plus adversarial verification, with nested flows allowed (a subagent may itself fan out and verify). The conductor stays ORGANIZATIONAL: it sequences iterations, enforces the gates, keeps the single-change discipline, and DELEGATES the deep work (diagnosing a missed golden, authoring a narrowed leaf, writing a critic prompt, analyzing a CI). The conductor does not hand-roll the deep analysis itself; it orchestrates ultracode workers and holds them to the 5-gate predicate.
Running a review (the product, every time)
cd skill-code-review
export GITHUB_TOKEN="$(gh auth token)" # specialists/workers read the repo
export CTXR_SCR_CALL_TIMEOUT=600 # per-agent-call ceiling (env-tunable)
uv run python -m code_review.cli review \
--repo tmp/repos/<ab>/<pr> --base <base_diff> --head <head> \
--run-dir tmp/runs/<run-id>/<ab>/<pr> --backend claude --max-workers 8 --clean
# (compute <ab>/<pr> paths via scripts/paths.py: repo_dir(pr) / run_dir(run_id, pr))
--backend is agent-agnostic (claude | codex | cursor | anthropic | openai; a
pydantic-ai backend covering Anthropic/OpenAI/Gemini with billed usage is planned, see the plan
section 5). --clean wipes the run dir's .skill-code-review for a fresh, cache-free run. The
runner drives the FSM in-process with an adaptive thread pool; worker + specialist calls are
fault-tolerant (retry/backoff, graceful degradation).
The loop (one iteration, in full)
- Read STATE.md (
experiments.py status) for the current rung, baseline, and next action. - Pick a falsifiable hypothesis naming ONE lever and the metric it should move; run the
dead-end guard (
experiments.py check). - Change the ONE lever. Edit the right layer (see Levers). Keep the diff attributable.
- Green gate.
ruff+mypy+pytest. Then DOGFOOD if the change touched code. - Measure the product CLI over the FROZEN PR set, N rounds (ranker-only N=5 via
rerank.py; full N=3). 100% diff coverage: every changed file is sharded into specialist units. - Judge once, consistently.
build_judge_input_prod.pyper PR, then judge (the panel rule) intojudge/<run-id>/<ab>/<pr>.json. Reuse committed competitor verdicts for apples-to-apples. - Record + CI.
experiments.py record --applythenci --apply; reduce to mean + bootstrap CI. - Diagnose. Per PR, which goldens were MISSED (recall loss -> specialist depth or leaf coverage)
and which non-goldens were marked
primary(precision loss -> ranker). Distinguish "real bug outside the golden set" (benchmark incompleteness) from "noise" (genuine over-reporting). - Gate.
experiments.py gate. All green -> commit + tag + regen STATE.md. Else REVERT the CHANGE and record the negative/inconclusive row. Never stop the PROGRAM; revert the CHANGE. - Ramp only after a tagged win (proof-before-scale). Escalate a COVERAGE gap to a human-gated
reviewers.wikichange (see "Coverage" lever and the wiki-build protocol).
Where to optimize (levers, by symptom)
- Precision (too many non-golden primaries):
workers/finding-ranker.md. Makeprimarythe block-this-PR set; demote real-but-secondary findings (defensive hardening, load/cost-only perf, no-test/magic-number) to advisory. Keep CONCRETE correctness/security bugs primary even when edge-case. The ranker emits compact per-index decisions; the runner re-attaches scores (dispatch._apply_rank_decisions). This is the cheapest lever (N=5 viarerank.py). - Recall (a golden no specialist surfaced):
workers/specialist.md. Add GENERAL bug-hunting heuristics (unset/missing-state null-deref; external-tool argument format/units; shadowed/duplicate definitions; cache-recursing-through-self). Emit ONE finding per distinct root cause - never bundle a None-deref with a KeyError. - Consensus / confidence (always-on multi-model): the planned
critique_findings+reconcile_findingsstages (plan section 5.3) inject ALL critic verdicts into reconciliation, reasoning-first. Multi-model is core and always-on by design; the PR count (not disabling consensus) is the cost throttle while developing. - Routing (right leaves not picked):
workers/tree-descender.md/trim-candidates.md(metadata-only, no file reads) - and leaffocus/activationin the wiki (thescr-reviewers-wiki-authoringskill). Beware broad**/*globs: they over-activate and bias routing toward generic leaves (and they are the #1 latency cost; see the speed levers). - Latency (routing dominates the budget): routing (
tree_descend+llm_trim) costs ~39% of wall-time and buys NO recall. S1 merges the two routing passes into one; S2 cuts the 107 broad**/*.{lang}globs to real signals (a human-gated corpus change). Both change the picked-leaf set, so both are F1-5-gated like any quality lever, with the cheap-first leaf-set sanity check first. Embedding routing was tried and REJECTED (OBSERVATIONS #44); see the plan section 5A.3. - Coverage (no leaf exists for a bug class): a reviewers.wiki change - human-gated; see below.
Wiki regeneration: the no-regression gate (HARD) and the change protocol
Any regeneration of reviewers.wiki (a corpus or structure change, the broad-activation cut, and
the full layout-driven rebuild) must clear the no-regression gate BEFORE it is promoted, on TOP of
the normal 5-gate flow:
- Re-run the PRODUCT reviewer on the SAME five pilot codebases used throughout:
cal.com-14943,discourse-1,grafana-80329,keycloak-32918,sentry-67876. - Compare against the recorded baseline on BOTH axes and require no regression on either: recall up-or-equal AND false-positives-per-PR down-or-equal. Better on one axis must not come at the cost of the other (this is GATE-1 + GATE-2 expressed at the corpus level).
- If either axis regresses, do NOT promote the regenerated wiki. Revert or iterate.
A change to the SET or STRUCTURE of reviewers (and any activation/glob narrowing, and any full
corpus regeneration) is additionally HUMAN-GATED: a written, statistically-justified proposal plus
explicit sign-off. It must also carry all FIVE parts of the wiki-build methodology change protocol
(plan section 6.9), recorded IN THE REPO (the plan and/or the authoring skill and the experiment
row), not just in chat: (1) the METHOD (which leaves, before/after activation, the deterministic
build/validate/promote flow); (2) the activation SIGNALS taxonomy per leaf (file_globs,
keyword_matches, structural_signals, escalation_from) and which signals replaced the broad
glob; (3) the GAPS (what the narrowed activation can no longer catch); (4) the SENSITIVITY analysis
(the recall risk of each narrowing and its mitigation - the half people skip, REQUIRED because recall
is sacred); (5) the REPRODUCTION steps (the authoring flow, scripts/check_wiki_drift.py, and the
five-PR no-regression benchmark). A wiki/corpus change missing any of the five is INCOMPLETE and must
not be promoted. Author in reviewers.src/, regenerate via skill-llm-wiki, validate, byte-verify
the drift, promote; never hand-edit the generated wiki. Follow scr-reviewers-wiki-authoring.
Architecture contract (baked into the product)
- 100% coverage, always. The dispatch loop shards EVERY changed file into units - 10 files or
1000, review them all; a failed unit is a
status: failedrow, not a lost file. - Regulated parallelism.
runner.pydispatches specialists through an adaptiveThreadPoolExecutor(AIMD: halve workers on rate-limit, +1 on success, bounded). - Deterministic, big-data-safe collection.
collect_findingsis pure Python; per-leaf findings are persisted and merged algorithmically - never one giant agent context. - Two-stage dedup. Deterministic location + embedding/token clustering, then a neutral ranker/deduper agent adjudicates residual duplicates (compact decisions only). Never silently merge distinct bugs.
- Never overflow context. Worker inputs are compacted (heavy leaf fields stripped for the prompt, rehydrated by id afterward); overflow sub-shards and retries.
- Fault tolerance. Worker AND specialist calls retry rate-limit/overflow with backoff; an
empty/unparseable agent reply is retryable; best-effort stages (e.g.
tool_discovery) DEGRADE instead of faulting; null optional fields are stripped before schema validation. A flaky routing worker can never zero a review (the deterministic coverage floor) - but the floor fires only when routing returns EMPTY, NOT per-leaf, so a pre-filter that silently drops asec-*leaf is NOT rescued by it. - Sharper specialists, neutral ranker. Specialists favor recall (report every plausible defect
with a
confidence); the ranker is the precision gate (primaryselection). Selectivity, not suppression - demote, don't drop. - Agent-agnostic + prompt-externalised. All prompts live in
workers/*.md(never hardcoded in Python); any backend (claude/codex/cursor/api) works. - Wiki structure is human-gated. Mechanical FSM/dispatch/prompt changes are free. Any change to
the SET or STRUCTURE of
reviewers.wikineeds a written, statistically-justified proposal and human confirmation - seescr-reviewers-wiki-authoringand the protocol above. - Tracked, gated experiments. Every round is recorded in
benchmarks/experiments.db; every change is proven by the 5-gate predicate before it is tagged;STATE.mdis regenerated each promote;tmp/OBSERVATIONS.mdcarries the status board.
Hard-won lessons (do not regress these)
- Drive reviews ONLY through the product (
cli.py/runner.py). Never build a parallel review orchestrator intmp/. tmp holds DATA + MEASUREMENT only. - Keep all prompts in
workers/*.md. Read them; never inline prompt text in Python. - Worker inputs must not be truncated. Truncating
activated_leavesat a char cap cut the array mid-list and silently dropped the alphabetically-late (lang-/sec-/footgun-/crypto-) leaves -> only generic antipatterns routed. Compact per-leaf (dropcovers/audit_surface), never truncate the SET. - Routing workers decide from metadata, not files. An agentic wiki-file-reading tree-descender/tool-runner is 6 to 8 min and times out; reading the brief metadata is ~1 min. The routing cost is INPUT SIZE x DOUBLE PASS (two sequential ~150K-token sonnet passes), not the model tier (routing already runs on sonnet, the cheap tier).
- The ranker is stochastic. At 5 PRs the noise rivals the deltas - average N rounds (ranker-only
N=5 via
rerank.py, full N=3) and prefer the full 50 for a stable signal. Don't add per-golden rules to chase the pilot: that is over-fitting. - Many "false positives" are real bugs outside the golden set. Audit before suppressing; don't make the product worse to game an incomplete golden set. The gold set was seeded from Augment/Greptile, so treat the offline number as comparative, not absolute.
- Embedding/dense-retrieval ROUTING was tried and REJECTED (OBSERVATIONS #44): a raw code patch is
out-of-distribution as a retrieval query and missed every
sec-*leaf pluslang-ruby/fw-rails. Any future retrieval pre-filter needs a net-new always-include guardrail (the coverage floor does not protect per-leaf) and a non-raw-patch query; see plan section 5A.3. - Never stop the PROGRAM on a regression; revert the CHANGE and record a negative. A failing iteration is not a failed program - it is one attributable dead-end recorded so no session re-walks it. The program advances by banking proven wins, not by avoiding failed hypotheses.
Lesson: a missed golden is often a missing leaf, not just noise
When a no-regression run shows a recall dip, do NOT just tune the ranker (noise). Investigate each MISSED golden and ask "does any leaf own this concern?" A golden the corpus misses is often a COVERAGE GAP (a missing reviewer for a common bug class), not a routing/noise problem. In one cycle the deterministic-wiki cutover dipped recall by one golden; investigation found NO dedicated leaf for three common classes, so three GENERALIZED footgun leaves were added (footgun-null-and-missing-state CWE-476, footgun-unintended-recursion CWE-674, footgun-destructive-query-scope CWE-1284) plus a sharpen to footgun-toctou-race for non-atomic counters. Result: recall recovered to no-worse-than-baseline AND noise more than halved (3.2 -> 1.4 fp/PR), F1 0.46 -> 0.62. Encode generalized reviewers (a whole bug class), never per-golden rules (over-fitting).
Recipe when a golden is missed:
- Re-judge the run consistently (the panel rule) so the miss is real.
- List the MISSED goldens per PR.
- For each, check the corpus for an OWNING leaf (does any reviewer's focus/covers/activation claim this concern?).
- For a genuine gap, add a GENERALIZED leaf for the bug class via
scr-reviewers-wiki-authoring(human-gated SET change, with all five protocol parts); never a per-golden rule. - Re-benchmark through the no-regression gate (recall up-or-equal AND fp/PR down-or-equal) plus the 5-gate predicate; keep only if it holds or improves the frontier, else revert the CHANGE.
Before any commit to skill-code-review
uv run ruff check code_review/ tests/
uv run mypy code_review/
uv run pytest
All three must pass. A CODE change additionally requires the DOGFOOD pass (run the product reviewer on
the diff and address its findings). A reviewers.wiki change additionally requires the human-confirmed
proposal, all five parts of the change protocol, the scr-reviewers-wiki-authoring build/validate flow,
and the five-PR no-regression benchmark. And remember the NO-PUSH gate: commit and tag LOCALLY on
feat/martian-benchmark-program; never push, open a PR, or merge until the USER authorizes it.