name: systematic-debugging description: Root-cause debugging loop for bugs, failing tests, flakes, CI failures, performance regressions, and unexpected behavior. Use before proposing fixes when anything is broken, especially during worker implementation or CI-watch repair.
Systematic Debugging
No fixes without a red-capable feedback loop and a root-cause investigation. Skip phases only when explicitly justified.
When exploring the codebase, read CONTEXT.md and relevant architecture guides
so the reproduction and fix use the right domain/module language.
Phase 1: Build A Tight Feedback Loop
If you have a tight pass/fail signal for the bug, you can find the cause. If you do not, no amount of staring at code is enough.
Try feedback loops in roughly this order:
- failing test at the seam that reaches the bug
- curl or HTTP script against a running service
- CLI invocation with a fixture input
- Playwright or browser script asserting on DOM, console, or network behavior
- replayed trace: saved request, payload, event log, or fixture
- throwaway harness around the smallest runnable subset
- property or fuzz loop for "sometimes wrong" behavior
- bisection harness if the bug appeared between known states
- differential loop between old/new versions or configs
- human-in-the-loop script only when a human must click or inspect
Phase 1 is done only when you can name one command you have run at least once and it is:
- red-capable: it drives the user's exact symptom, not just nearby code
- deterministic: or, for flakes, has a high enough reproduction rate
- fast: seconds when practical
- agent-runnable: unattended unless the task truly requires HITL
Tighten the loop before moving on: make it faster, sharper, and more deterministic.
Phase 2: Reproduce And Minimize
Run the loop and watch it fail. Confirm the failure mode matches the user's symptom.
Then minimize the reproduction. Cut inputs, callers, config, data, and steps one at a time, re-running the loop after each cut. Stop when every remaining element is load-bearing: removing any one of them makes the loop pass.
Phase 3: Hypothesize
Generate 3-5 ranked hypotheses before testing any of them. Each hypothesis must be falsifiable:
If
is the cause, then will make the bug disappear or will make it worse.
Show the ranked list to the user when they are present, but proceed with the best-ranked hypothesis if they are AFK.
Phase 4: Instrument
Each probe must map to a prediction from Phase 3. Change one variable at a time.
Prefer:
- debugger or REPL inspection when available
- targeted logs at boundaries that distinguish hypotheses
- never "log everything and grep"
Tag temporary diagnostics with a unique prefix such as [DEBUG-a4f2] so cleanup
is a single search.
For performance regressions, establish a baseline measurement first. Use timing harnesses, profiling, query plans, or bisection before changing code.
Phase 5: Fix And Regression Test
Write or identify the regression test before the fix when a correct seam exists. A correct seam exercises the real bug pattern as it occurs at the call site. If the available seam is too shallow, note that architecture finding and fix the bug without pretending the shallow test proves it.
Then:
- turn the minimized repro into a failing test when practical
- watch it fail
- apply the root-cause fix
- watch the regression pass
- re-run the original Phase 1 loop
Keep the change inside the Linear issue scope when running as a worker or CI-watch repair.
Phase 6: Cleanup And Post-Mortem
Before declaring done:
- original repro no longer reproduces
- regression test passes, or absence of correct seam is documented
- all
[DEBUG-...]diagnostics are removed - throwaway harnesses/prototypes are deleted or clearly marked
- the correct hypothesis is stated in the PR, commit, or Linear evidence
Then ask what would have prevented the bug. If the answer is architectural, hand
off the specifics to /improve-codebase-architecture after the fix is in.
Ceird-Specific Checks
- Effect code: prefer typed errors,
Schema,Config, services, and layers; avoid casual thrown errors at boundaries. - Auth/org changes: verify fail-closed behavior and server-derived context.
- Persistence: inspect generated Drizzle migrations and query shape.
- Alchemy/provider behavior: do not run mutating provider commands without confirmed stage and credentials.
- UI flakes: prefer condition-based waits and user-visible assertions over arbitrary sleeps.