z-debug - SKILL.md Agent Skill

name: z:debug description: Systematic root cause debugging — six-phase investigation with bounded retries, scope lock, and escalation rules

z:debug — Root Cause Debugging

Iron Law

No fixes without root cause investigation first.

Fixing symptoms creates whack-a-mole debugging. Every fix that skips root cause analysis makes the next bug harder to find.

Runs on all tiers (Quick, Standard, Full).

Phase 1: Root Cause Investigation

Gather context before forming any hypothesis.

Read error messages carefully. Stack traces completely. Note line numbers, file paths, error codes.
Read the code. Trace the code path from symptom to potential causes. Use Grep for references, Read for logic.
Check recent changes:
```
git log --oneline -20 -- <affected-files>
```
If the bug is a regression, root cause is in the diff.
Reproduce. Can you trigger it deterministically? If not, gather more evidence. Do not guess.
Trace data flow. Where does the bad value originate? What called this with bad input? Keep tracing upstream until you find the source.

For multi-component systems (API, service, database, etc.):

Add diagnostic instrumentation at each boundary BEFORE proposing fixes
Log what enters and exits each component
Run once to gather evidence showing WHERE it breaks
THEN investigate the specific failing component

Output: "Root cause hypothesis: ..." — a specific, testable claim.

Phase 2: Pattern Analysis

Check if the bug matches a known pattern:

Pattern	Signature	Where to look
Race condition	Intermittent, timing-dependent	Concurrent access to shared state
Nil/null propagation	TypeError, undefined is not	Missing guards on optional values
State corruption	Inconsistent data, partial updates	Transactions, callbacks, hooks
Integration failure	Timeout, unexpected response	External API calls, service boundaries
Configuration drift	Works locally, fails in prod	Env vars, feature flags, DB state
Stale cache	Shows old data, fixes on clear	Redis, CDN, browser cache

Also check:

git log for prior fixes in the same area. Recurring bugs in one area = architectural smell.
Web search for {framework} {error type}. Sanitize first: strip hostnames, IPs, paths, customer data.

Phase 3: Hypothesis Testing

Before writing ANY fix:

Form a single hypothesis: "I think X is the root cause because Y"
Test minimally: smallest possible change to verify. One variable at a time.
Verify: did it work? Yes = proceed to Phase 4. No = new hypothesis.

3-Strike Rule

If 3 hypotheses fail, STOP. This is likely architectural, not a simple bug.

Present the escalation:

3 hypotheses tested, none match. This may be architectural.
A) Continue — I have a new hypothesis: [describe]
B) Escalate — needs someone who knows the system
C) Add logging — instrument the area, catch it next time

Phase 4: Scope Lock

After forming root cause hypothesis, restrict edits to the affected module.

Identify the narrowest directory containing the affected files. Report to the user:

Edits restricted to <dir>/ for this debug session. Prevents accidental changes to unrelated code.

Do not touch files outside the locked scope without explicit user approval.

Phase 5: Implementation

Fix the root cause, not the symptom. Smallest change that eliminates the actual problem.
Minimal diff. Fewest files, fewest lines. Do not refactor adjacent code.
Write a regression test that:
- FAILS without the fix
- PASSES with the fix
Run the full test suite. No regressions.
Blast radius check: if the fix touches >5 files, flag it.

Bounded Retry

After implementing a fix, run tests.
If tests still fail: analyze the failure, fix (attempt 1).
If still failing: one more attempt (attempt 2).
If still failing: STOP and escalate. Do NOT attempt fix #3 without discussing architecture.

Phase 6: Verification & Report

Fresh verification: reproduce the original bug and confirm it is fixed. This step is NOT optional.

Debug Report

DEBUG REPORT
══════════════════════════════════════
Symptom:         [what the user observed]
Root cause:      [what was actually wrong]
Fix:             [what was changed, with file:line references]
Evidence:        [test output showing fix works]
Regression test: [file:line of the new test]
Status:          DONE | DONE_WITH_CONCERNS | BLOCKED
══════════════════════════════════════

Red Flags — STOP and Return to Phase 1

"Quick fix for now" — there is no "for now"
Proposing a fix before tracing data flow — that is guessing
Each fix reveals a new problem — wrong layer, not wrong code
"Should work now" without running verification
"One more fix attempt" after 2+ failures

Common Rationalizations

Excuse	Reality
"Too simple to investigate"	Simple bugs have root causes too
"Emergency, no time"	Systematic is faster than thrashing
"Just try this first"	First fix sets the pattern. Do it right.
"I see the problem"	Seeing symptoms is not understanding root cause
"One more fix" (after 2+)	3+ failures = architectural problem