name: multi-agent-debate-and-execution description: Complete lifecycle skill for running long-horizon multi-agent engineering work — adversarial debate that produces verdicts, critic-gated batch execution that ships work in small reviewable units, and closure-phase multi-angle review before main-branch merge. Auto-loads when task involves "adversarial debate", "critic-gate", "longlast teammate", "packet execution", "batch dispatch", "multi-batch", "verdict drift", "PR closure review", "stage-gated revert", "honest disclosure", "cite-CONTENT discipline", "MEDIUM-risk surface pre-flag", "PATH A precision-favored", "sibling-coherence", "carry-forward LOW", "5th outcome category", "anti-rubber-stamp template", "co-tenant safe staging", "stash-and-patch", "plumbing-merge", "cross-module orchestration seam", or when work spans 30+ hours / 5+ batches / 2+ rotating sessions. Generic to any project; case-study Zeus R3 §1 #2 (5 packets / 32 critic cycles / 100% anti-rubber-stamp / 1 earned REVISE / 0 BLOCK).
Multi-Agent Adversarial Debate + Critic-Gated Execution Lifecycle
A complete reusable workflow for long-horizon engineering work that needs high-trust verification across rotating sessions.
§0 Scope — when this skill applies
Use this skill when:
- Engineering work will span 30+ hours / 5+ batches / multiple rotating sessions
- High-risk surfaces (live modules, schemas, calibration stores, retrain pipelines) are touched
- Work requires verifiable trust across session boundaries
- The cost of a silent regression > the cost of explicit review gates
- A team of multiple agents must coordinate without a single agent owning end-to-end context
Do NOT use this skill for:
- One-shot small fixes (a single commit, no batch decomposition)
- Pure exploration with no implementation deliverable
- Tasks where one agent can hold full context end-to-end
§1 Lifecycle overview — three phases
PHASE A — DEBATE PHASE B — EXECUTION PHASE C — CLOSURE
──────────────── ──────────────────── ──────────────────
R1: Pro / Con / cross-exam Per-packet boot + 3-batch Multi-angle review
R2: Alt-system / synthesis GO → DONE → REVIEW cycle (architect/critic/
R3: Capital allocation / Critic-gate per batch explore/scientist/
sequencing K3-surface pre-flag verifier)
Carry-forward LOW
Output: verdict.md Output: shipped commits Output: PR-ready
+ critic evidence trail with full review
The DEBATE phase is fully covered by the existing methodology doc at docs/methodology/adversarial_debate_for_project_evaluation.md (§0-§8). This skill focuses on EXECUTION + CLOSURE which were previously implicit in §5.
§2 Pre-execution setup
§2.1 Spawn longlast teammates
You need at minimum two persistent teammates (team skill or equivalent):
- executor — implements batches; reads context; surfaces KEY OPEN QUESTIONS in boot evidence; commits with critic-gate APPROVE
- critic — independently reviews each batch; runs 10-ATTACK probes; produces APPROVE / APPROVE-WITH-CAVEATS / REVISE / BLOCK verdicts
Optionally:
- document-specialist — for SDK/API documentation lookups during execution
- explore — for codebase searches that would clutter executor's context
§2.2 Idle-only bootstrap
Spawn each teammate with an idle-only bootstrap prompt:
You are <name> in team <team-name>. Judge: team-lead.
ROLE: <one sentence>
This is BOOT-ONLY. Do NOT engage substantively. Do NOT take action.
Read these files:
- <root authority doc>
- <relevant invariants>
- <recent verdict / dispatch>
Then write to <evidence-path>/_boot_<name>.md (≤300L) with:
- §0 read summary
- §1 expected challenges
- §2 known limitations
- §3 idle commitment
SendMessage team-lead: "BOOT_ACK_<NAME> path=<abs path>". Then idle.
DO NOT engage. Wait for explicit dispatch.
This avoids the failure mode where teammates start work before the team-lead has reviewed boot evidence and confirmed scope.
§2.3 Team config + disk-first protocol
- Use
~/.Codex/teams/<team-name>/config.json(or platform equivalent) - All inter-agent state lives on disk first; SendMessage is delivery only
- Teammates write evidence files BEFORE sending status messages
- SendMessage drop pattern is empirical — disk is canonical
§2.4 Critic gating discipline
The critic MUST:
- Be a separate persistent agent (not Agent-tool-spawned subagent which dies on session restart)
- Have its own SKILL bootstrap loaded
- Run independently of executor's claims (re-runs tests, re-verifies cites, re-greps surfaces)
- Use the 10-ATTACK template (§9 below)
- Never write "narrow scope self-validating" or "pattern proven without test" — these are rubber-stamp tells
§3 Per-packet boot template
A "packet" is a coherent body of work that decomposes into 2-5 batches. Each packet starts with a BOOT phase.
§3.1 Team-lead dispatches DISPATCH_
DISPATCH_<PACKET_NAME> (one-line dispatch)
Authority basis: <verdict path + section anchor>
PACKET SCOPE per <plan source §X>:
"<verbatim quote of what the packet covers>"
Risk rating: LOW / MEDIUM / HIGH / HIGHEST per <criterion>
DEFAULT FRAMING (open to your boot challenge):
PATH A measurement-only first (precision-favored; see §6)
PROVISIONAL N-BATCH DECOMPOSITION:
- BATCH 1 (~Xh): <description>
- BATCH 2 (~Yh): <description>
- BATCH 3 (~Zh): <description>
NOT-IN-SCOPE (will reject expansion):
- <list of files/surfaces NOT to touch>
- <list of explicitly deferred items>
CARRY-FORWARD from prior packet (fold into BATCH N):
- LOW-X: <description>
BOOT-ONLY DISPATCH:
- Read <list of context files>
- Verify K1 read-only contract for <surface>
- Write boot evidence to <abs path> (~150-300L) with §0/§1/§2/§3/§4/§5/§6 structure
- SendMessage `BOOT_ACK_<EXECUTOR>_<PACKET> path=<abs>`. Then idle.
DO NOT execute BATCH 1 until you receive explicit GO_BATCH_1_<PACKET>.
§3.2 Executor writes boot evidence (§0-§6 standard structure)
# <PACKET> packet — executor boot
Created: YYYY-MM-DD
Author: <executor>@<team>
Source dispatch: DISPATCH_<PACKET>
Plan-evidence basis: <verdict>
## §0 Read summary
| Source | What I learned |
|---|---|
| <file:line> | <one-line takeaway> |
## §1 KEY OPEN QUESTIONS (the load-bearing findings)
### KEY OPEN QUESTION #1 — <structural reality finding>
**Dispatch said:** "<verbatim quote>"
**Reality at HEAD:** <what I actually found>
**Implication:** <PATH A/B/C choice>
## §2 Per-batch design sketch
### BATCH 1 — <function-or-module> + tests (~Xh)
**Files**: <list>
**Function signature**: <code>
**Tests** (~N tests): <numbered list>
## §3 Risk assessment per batch
| Batch | Risk | Mitigation |
## §4 Discipline pledges
- ARCH_PLAN_EVIDENCE = <path>
- file:line cites grep-verified within 10 min before commit
- LOW-CAVEAT-XX-N-M lessons applied
- Co-tenant safe staging
- NO commits without critic-gate APPROVE
## §5 Out-of-scope
- <NOT-IN-SCOPE items reaffirmed>
## §6 Open clarifications for team-lead (defaults if no specific guidance)
1. **<question>**: option A / B / C. **Default: <recommendation>**
§3.3 The KEY OPEN QUESTION pattern is load-bearing
Boot evidence ALWAYS surfaces structural reality mismatches between dispatch's intended axis and HEAD's actual surface. Empirically caught 4 of 5 times in case study:
- WS_POLL: dispatch said "ws_share/poll_share" → HEAD has no
update_source→ PATH A - CALIBRATION: dispatch said "(city, target_date, strategy_key)" → HEAD persists per-bucket → PATH A
- LEARNING_LOOP: dispatch said "no append-only history (per prior packet)" → HEAD HAS calibration_params_versions → HONEST DISCLOSURE
Skipping this surfaces the misread later as a critic finding (or worse, ships with the misread).
§4 Per-batch GO / DONE / REVIEW cycle
§4.1 Team-lead dispatches GO_BATCH_X with §6 resolutions
GO_BATCH_X_<PACKET> — boot evidence APPROVED.
§6 clarification resolutions (all <N> ACCEPT-DEFAULT):
1. **<question>**: <decision>
...
EXECUTION ORDER (strict):
- BATCH X → critic-gate review → land if APPROVE
- File:line + content cites grep-verified within 10 min before commit (cite-CONTENT discipline)
- NO `git add -A` (co-tenant safety)
- Update BASELINE_PASSED in pre-commit-invariant-test.sh
- Per-batch critic dispatch is ON ME after each BATCH_X_DONE
Cross-batch reminder: HARD NOT-IN-SCOPE for <surfaces> is the WRITER side.
Idle awaiting BATCH_X_DONE_<PACKET>.
§4.2 Executor implements + sends BATCH_X_DONE
Standard executor message format:
BATCH_X_DONE_<PACKET> files=<file1> + <file2> + ... tests=<N> passed <M> skipped vs UPDATED baseline <N>/<M>/<F> → EXACT MATCH baseline=preserved planning_lock=<receipt or N/A>
Commit: <SHA> "<commit subject>"
<paragraph: design summary, key tradeoffs, sibling-coherence cites>
CRITIC PRE-FLAG (per GO_BATCH_X instruction): <surface> is <RISK> per <AGENTS.md cite>. The N read additions are <pure SELECT / no schema mutation / no impact on writers>. Critic-harness <Nth> cycle should verify: (a) ..., (b) ..., (c) ...
<N> tests:
- <category>: <test names>
Carry-forward lessons honored:
- LOW-X-Y honored via <how>
NOT-TOUCHED per dispatch §NOT-IN-SCOPE: <list>
Idle for critic-harness <Nth> cycle review.
§4.3 Team-lead independent verification + critic dispatch
Before forwarding to critic, team-lead independently:
- Re-runs the cited test count (pytest the specific files)
- Verifies BASELINE_PASSED arithmetic
- Greps for K1 violations (INSERT/UPDATE/DELETE in NEW lines only —
git diff PREV..NEWfiltered) - Confirms file count matches commit (
git show --stat)
Then dispatches critic with structured 10-ATTACK probe list (§9).
§4.4 Critic returns BATCH_X_REVIEW_DONE
Standard critic verdict format:
BATCH_X_REVIEW_DONE_<PACKET> <VERDICT> path=<critic evidence file>
<Nth> critic cycle. <VERDICT> (<count> LOW <count> MEDIUM <count> BLOCK <count> REVISE).
<paragraph: which probes PASSED / which surfaced concerns>
Verification (N ATTACK probes all PASS):
- <bullet list>
Cycle-prior LOWs RESOLVED:
- LOW-X-Y-Z: <how it was resolved>
NEW LOWs track-forward:
- LOW-X-Y-Z (severity): <description> + <suggested fix path or DEFERRED>
AUTHORIZE push of <SHA> → <PACKET> BATCH X LOCKED. Ready for GO_BATCH_(X+1).
Cycle metrics: N cycles, A clean APPROVE, B APPROVE-WITH-CAVEATS, C REVISE, D BLOCK.
§4.5 Team-lead pushes (if APPROVE) and dispatches next batch
After APPROVE-or-better:
git push origin <branch>(FF expected)- Dispatch GO_BATCH_(X+1) with carry-forward LOWs from this cycle folded in
- Update task tracker
If REVISE:
- Forward critic's defects list to executor
- Wait for executor's BATCH_X_REVISE commit
- Re-dispatch critic for follow-up review
§5 3-batch decomposition pattern
For measurement/observation packets, the recurring decomposition is:
BATCH 1 — Pure-data projection
- Read-only K1-compliant surface
- Returns dict[bucket_key, snapshot_dict]
- Sample-quality boundaries (e.g., 10 / 30 / 100 thresholds)
- ~6-10h, ~9-15 tests
- Mesh registration: source_rationale.yaml + test_topology.yaml
BATCH 2 — Detector
- Pure-Python over BATCH 1 outputs
- Ratio test or KL-divergence (sibling defaults: 1.5x warn / 2.0x critical)
- Verdict dataclass:
kindLiteral +severityOptional +evidencedict - insufficient_data graceful (trailing_std<=0 or n_windows<min)
- ~4-6h, ~6-7 tests
BATCH 3 — Weekly runner + AGENTS.md + e2e tests
- CLI script:
--end-date / --window-days / --critical-cutoff / --override-bucket KEY=VALUE / --db-path / --report-out / --stdout - JSON output:
report_kind=<name>_weekly, report_version=1, ... - Exit 0 if no detection; exit 1 if any (cron-friendly)
- Sibling-symmetric with prior weekly runners (script_manifest.yaml entry field-by-field same shape)
- AGENTS.md sections: Scope / Output schema / Threshold defaults TABLE / KNOWN-LIMITATIONS / Severity tier rationale / Operator runbook
- ~3-5h, ~5-7 e2e tests
For other packet shapes (action / mutation), this decomposition may not fit — adapt or use a different decomposition. The pattern works for OBSERVABILITY packets specifically.
§6 Risk framing patterns
§6.1 PATH A / B / C decision tree
When dispatch's intended axis doesn't match HEAD's substrate:
| Path | Framing | Use when |
|---|---|---|
| PATH A (precision-favored) | Drop the unsupported axis from contract; measure only what's supported | Default. Honest. Documented limitation. Mirrors prior 4-of-5 packets in case study. |
| PATH B (recall-favored heuristic) | Use a heuristic classifier or proxy join | Only when operator explicitly authorizes; risks invented-data critique. |
| PATH C (writer extension) | Modify the upstream writer to add the missing axis | Out-of-scope by default; requires explicit operator dispatch as separate packet. |
§6.2 Surface classification (K0 / K1 / K2 / K3)
Define for your project (or import from existing AGENTS.md):
- K0 — frozen kernel (DB schema, contracts); never mutate without schema migration packet
- K1 — read-only projections; pure SELECT; aggregation in memory; safe extension surface
- K2 — derived/auxiliary state; controlled mutation OK with critic-gate
- K3 — live execution path; touching writer side requires HIGH-risk gate + operator dispatch
§6.3 K3-adjacent surface pre-flag pattern
When a packet needs to add a read-only function to a K3 module (e.g., list_active_X(conn) -> list[dict]):
- Add ONLY pure-SELECT (zero INSERT/UPDATE/DELETE)
- Pre-flag in commit message + dispatch + boot evidence
- Critic dispatch includes explicit "attack hardest here" instruction with specific verification (read filter exactly mirrors existing reader; no cross-coupling; pre-table-missing graceful)
- Sibling-coherent with prior K3 read additions
Used 3× in case study (CALIBRATION store.py + LEARNING retrain_trigger.py); each verified clean by critic.
§6.4 HIGHEST-risk surface boot-then-confirm
When packet inherently touches HIGH-risk modules (writers, retrain triggers, calibration store mutators):
- Dispatch BOOT-ONLY first (no GO_BATCH_1 in same message)
- Executor surfaces full risk surface map in §3 of boot evidence
- Team-lead reviews with explicit veto power on each touched surface
- Operator can intervene at boot-evidence stage before any code changes
§7 Discipline patterns (load-bearing antibodies)
§7.1 Carry-forward LOW pattern
Each critic cycle produces 0-N LOW caveats. They aren't blocking but they accumulate. Pattern:
- Cycle N produces LOW-X-N-M (e.g., LOW-CITATION-CALIBRATION-1-1 from cycle 27)
- Cycle N+1 dispatch explicitly folds LOW-X-N-M into BATCH (N+1) carry-forward instructions
- Cycle N+1 executor addresses it; cycle N+1 critic verifies the fix
- LOWs not addressed by next cycle become "tracked forward" — eligible for follow-up commits
Empirically: 24 of 32 cycles produced APPROVE-WITH-CAVEATS; ALL LOWs resolved by packet close.
§7.2 Cite-CONTENT discipline (cycle-29 sustained)
Beyond grep-verifying file:line citations, also verify the cited CONTENT actually says what the citation claims. Cycle 29 caught a src/calibration/AGENTS.md L14-22 alpha-decay rationale cite where L14-22 was a danger-level table, not the alpha-decay rationale.
Empirical dividend: cycle 30 immediately caught a substrate misread (claim "no append-only history" → reality calibration_params_versions exists in retrain_trigger.py). The discipline note IS an antibody — Fitz Constraint #3 immune system pattern.
§7.3 HONEST DISCLOSURE pattern
When you discover a prior packet/cycle made a substrate-misread claim:
- Surface in current packet's module docstring
- Add cross-link correction in prior packet's AGENTS.md (§CORRECTION subsection)
- Cite the cycle-N discipline lesson that caught it
- Acknowledge without dramatic framing (executor stays calm; critic verifies independently)
This converts a near-miss into a methodology dividend — the operator sees the discipline producing measurable value.
§7.4 Boundary tests rigor (LOW-CAVEAT-EO-2-2)
For every threshold (warn vs critical, ratio vs absolute, days vs counts):
- Pin EXACTLY the threshold value (e.g., ratio==1.5 → within_normal; ratio==1.501 → drift)
- Test BOTH directions of the boundary
- Make strict-vs-inclusive semantics explicit in test names
This catches off-by-one defects that ratio-test detectors are otherwise vulnerable to.
§7.5 Co-tenant safe staging
When operator (or another agent) has unstaged work in the same repo:
- NEVER
git add -A(absorbs co-tenant changes) - Stage SPECIFIC paths:
git add -- path1 path2 ... - Verify staged set with
git status --shortbefore commit - If a shared file (e.g.,
architecture/test_topology.yaml) needs your single-line addition AND has co-tenant edits:- Stash the file:
git stash push -- <file> - Re-edit your single-line change in clean state
- Stage YOUR file
git stash popto restore co-tenant edits (still unstaged)
- Stash the file:
Verified successful in CALIBRATION_HARDENING BATCH 3 + LEARNING_LOOP BATCH 3.
§7.6 Plumbing-merge for hook-bypass FF
When a clean FF merge is correct but pre-commit hooks fail due to co-tenant WIP test failures:
TREE=$(git merge-tree --write-tree main feature)
COMMIT=$(git commit-tree $TREE -p main -m "merge(sync): ...")
git update-ref refs/heads/main $COMMIT
This bypasses working tree + index + hooks. Only use when:
- The merge is provably FF (no conflicts)
- The hook failures are confirmed co-tenant WIP (not real regressions)
- Operator has explicitly authorized
§8 Cross-module composition seam
When a weekly runner needs to compose outputs from sibling packets (e.g., parameter_drift result feeds into learning_loop_stall detector):
Anti-pattern: detector module directly imports + reads sibling module's DB → cross-coupling, harder to test, breaks K1 purity
Pattern: caller-provided seam
# Bad: detector reads cross-module
def detect_X(conn, ...):
drift = compute_other_thing(conn, ...) # cross-module DB read
...
# Good: caller provides cross-module result
def detect_X(history, *, drift_detected=None, ...): # pure Python
if drift_detected is None:
# honest tri-state: we can't tell yet
...
The orchestration happens in the runner (BATCH 3), which is allowed to call multiple sibling modules. Detector modules stay pure.
Tri-state honesty: when CALIBRATION's parameter snapshot history is insufficient, runner records drift_detected=None (not False). False would imply "we checked and found no drift"; None correctly says "we can't tell yet". Operator runbook documents the distinction.
§9 Critic 10-ATTACK template
Standard critic dispatch includes 10-12 ATTACK probes. Categories that empirically catch defects:
| # | Probe category | Example |
|---|---|---|
| 1 | Independent test reproduction | "Re-confirm baseline X/Y/Z independently in a fresh shell" |
| 2 | Independent CLI/REPL probe | "cd /tmp && python3 /repo/scripts/ |
| 3 | Surface coupling check | "Verify NEW code does NOT cross-import K3 active surfaces" |
| 4 | Cite-content verification | "Read cited file:lines and confirm content matches claim" |
| 5 | Boundary semantic | "ratio==threshold → which side? Is strict-vs-inclusive pinned?" |
| 6 | K1 compliance | "Grep `INSERT |
| 7 | Co-tenant safety | "Confirm exactly N files in commit; co-tenant unstaged left alone" |
| 8 | Honest tri-state | "When data is insufficient, does function return None vs False vs True?" |
| 9 | Sibling coherence | "Defaults match prior packets; dataclass shape mirrors siblings" |
| 10 | Operator runbook actionability | "AGENTS.md tells operator what to do per outcome?" |
Plus packet-specific probes for HONEST DISCLOSURE / KEY OPEN QUESTION verification.
§9.1 Anti-rubber-stamp tells
Critic verdicts that contain these phrases need rejection (operator should escalate):
- "Pattern proven" (without citing the test that proves it)
- "Narrow scope self-validating" (translation: I didn't actually verify)
- "Trust the executor's test count" (without independent reproduction)
- "All tests pass" (without naming which N tests / which baseline)
§9.2 Rubber-stamp resistance via independent reproduction
Critic MUST run at least one independent verification per BATCH:
- Re-run pytest from a fresh shell
- Re-grep cited line ranges
- Issue REPL probe of the new function with synthetic input
git show --statto verify file count claim
These take 1-2 minutes; they catch ~10% of cycles where executor's narrative doesn't match the actual diff.
§10 Closure phase — multi-angle review before main-merge
After all batches APPROVED on dev branch, before opening / merging PR to main:
§10.1 Multi-angle review (5 parallel sub-agents)
Per memory feedback_multi_angle_review_at_packet_close, dispatch 5 parallel sub-agents covering different angles:
- architect — high-level structural decisions; cross-module impact; long-term maintainability
- critic — adversarial probes; edge cases; failure modes
- explore — codebase coverage; unintended cross-module references; orphan code
- scientist — empirical validation; metric soundness; statistical correctness (if applicable)
- verifier — completion checks; evidence adequacy; test sufficiency
Each writes to evidence/closure/<role>_review_YYYY-MM-DD.md. Team-lead synthesizes into PR description.
§10.2 PR description template
## Summary
<commit count>-commit fast-forward / merge-commit. <conflicts status>. Test baseline X passed / Y skipped / 0 failed.
## What's shipping
### <Packet 1> (<N> commits)
- <commit SHA> <subject>
...
## Architecture
- ZERO touches to K0 frozen / K3 active / schema
- <count> small read-only additions to <surface> (each critic-gated)
- ZERO crossing with mainline-owned files (per <verdict> §1)
## Test baseline
<reproduction command + output>
## Critic review provenance
<N> critic cycles total: A clean APPROVE / B APPROVE-WITH-CAVEATS / C REVISE / D BLOCK
Anti-rubber-stamp 100% maintained.
Per-packet review evidence under <path>.
<Notable methodology dividends>
## Test plan
- [x] FF-merge confirmed
- [x] Pytest baseline reproduced
- [x] All N ephemeral worktree branches verified merged
- [x] Multi-angle closure review (5 sub-agents) completed
- [ ] Operator merges PR
## Follow-ups (tracked, not in this PR)
- <list>
§10.3 Operator-only merge gate
Team-lead does NOT merge. PR sits open until operator reviews multi-angle evidence + merges. This preserves the explicit operator-authorization gate for shared-state actions.
§11 Failure modes + recovery
F1: SendMessage drop (executor BATCH_X_DONE not arriving)
Symptom: executor sent message; team-lead never received it.
Recovery: poll disk. Agent evidence files (<packet>_boot.md, commits, etc.) are canonical. SendMessage is delivery-only; if message dropped, the work is still there. Read commit log + evidence dir. Resume by team-lead acknowledging the work as if message arrived.
F2: Crossed-in-flight (executor pre-shipped while dispatch was in transit)
Symptom: executor sends BATCH_X_DONE before team-lead's GO_BATCH_X arrives.
Recovery: verify pre-shipped commit functionally aligns with dispatch (same function signature / same defaults / same tests). If aligned, push without modification + acknowledge crossed-in-flight in next dispatch. If misaligned, dispatch BATCH_X_REVISE.
Empirical: happened once in case study (WS_POLL BATCH 2). Pre-shipped work was fully aligned; pushed without modification.
F3: Cycle drift (executor's narrative diverges from diff)
Symptom: executor claims "I did X" but git show reveals X wasn't done OR was done incorrectly.
Recovery: critic-gate catches this empirically (cycle 22 caught the WP-1-1 row multiplication where executor claimed "n_signals counts unique ticks" but SQL was SELECT count(*) over JOIN producing duplicates). Always run independent reproduction.
F4: Co-tenant absorption
Symptom: git diff HEAD~1 shows files that weren't in your stage list.
Recovery: git reset --soft HEAD~1 to unstage; git stash pop to recover co-tenant work; re-stage YOUR specific files; re-commit. Critic catches via "exactly N files in commit" probe.
F5: Citation rot
Symptom: cite to <file>:<line> no longer matches the claimed content (file changed since cite was written).
Recovery: cycle-29 cite-CONTENT discipline. Grep-verify CONTENT (not just line number) within 10 min before commit. Citations rot ~20-30%/week per memory feedback_zeus_plan_citations_rot_fast.
F6: Wrong baseline cited
Symptom: hook fails with "PASSED < BASELINE_PASSED" because baseline count was set wrong.
Recovery: independent reproduction. Critic re-runs the exact 11-file baseline command; corrects to actual count; updates hook. Memory feedback_critic_reproduces_regression_baseline codifies this.
§12 Empirical track record (Zeus R3 §1 #2 — case study)
5 packets shipped over ~1 work-session day:
| Packet | Commits | Risk | Cycles |
|---|---|---|---|
| EDGE_OBSERVATION | 3 | LOW | 3 |
| ATTRIBUTION_DRIFT | 3 | LOW | 3 |
| WS_OR_POLL_TIGHTENING | 5 | MEDIUM (1 REVISE) | 4 |
| CALIBRATION_HARDENING | 3 | MEDIUM (K3-adjacent) | 3 |
| LEARNING_LOOP | 3 | HIGHEST (K3 retrain_trigger) | 3 |
Total: 17 commits / 32 critic cycles / 144 new tests / 0 BLOCK / 1 earned REVISE / 100% anti-rubber-stamp maintained
Methodology dividends observed:
- Cycle-29 cite-CONTENT discipline → cycle-30 caught substrate misread within 24h (Fitz Constraint #3 antibody pattern)
- PATH A pattern caught structural mismatch in 4 of 5 packet boots (avoided invented-data shipping)
- K3-adjacent pre-flag pattern landed 3 read-only surface additions clean (store.py + retrain_trigger.py)
- Carry-forward LOW pattern resolved 100% of LOWs by packet close
- ZERO crossing with R3 mainline (independently audited)
§13 Cross-references
Existing methodology
docs/methodology/adversarial_debate_for_project_evaluation.md— DEBATE phase (R1/R2/R3).Codex/skills/zeus-methodology-bootstrap/SKILL.md— auto-load on debate-class tasks
Memory feedback notes (memory: load if relevant)
feedback_critic_prompt_adversarial_template— 10-ATTACK template; never write "narrow scope self-validating"feedback_executor_commit_boundary_gate— executor cannot self-approve over multi-batch workfeedback_zeus_plan_citations_rot_fast— file:line citations rot ~20-30%/weekfeedback_converged_results_to_disk— SendMessage drop pattern; disk is canonicalfeedback_idle_only_bootstrap— spawn longlast teammates with idle-only boot promptfeedback_no_git_add_all_with_cotenant—git addSPECIFIC files; never-Afeedback_multi_angle_review_at_packet_close— 5 parallel sub-agents before DEBATE_CLOSEDfeedback_critic_reproduces_regression_baseline— critic always re-runs regressionfeedback_grep_gate_before_contract_lock— grep-verify file:line before lockfeedback_critic_via_team_not_agent— critic must be native team member, not Agent-spawnedfeedback_default_dispatch_reviewers_per_phase— auto-dispatch critic post-implementation
Related skills
.Codex/skills/zeus-phase-discipline/SKILL.md— Mode C per-batch discipline.Codex/skills/zeus-task-boot-*/SKILL.md— task-specific boot profiles
§14 Maintenance
This skill is living. Update after each methodology cycle that:
- Surfaces a new outcome category (cycle-5 added "5th outcome: stage-gated revert")
- Surfaces a new discipline pattern (cycle-29 added "cite-CONTENT discipline")
- Surfaces a new failure mode (cycle-22 catches like WP-1-1 row multiplication)
- Adds a new sibling-coherence rule
When adding to this skill:
- Cite the specific cycle that surfaced the pattern
- Add to §7 (discipline) or §11 (failure modes) or §6 (risk framing) as appropriate
- Update the case-study metrics in §12
- Cross-link from
feedback_*memory note if applicable
§15 Lineage
v1 (2026-05-03): created post R3 §1 #2 5-packet phase (32 critic cycles, 100% anti-rubber-stamp). Generalizes the EXECUTION + CLOSURE phases that were previously implicit in adversarial_debate_for_project_evaluation.md §5. Designed for reuse across any project; case-study Zeus-specific but patterns are generic.
Replaces the implicit "team-lead figures it out from feedback memory + prior session" pattern with an explicit reusable skill.