debug - SKILL.md Agent Skill

name: debug description: Root cause debugging before fixes. Use when investigating bugs, diagnosing test failures, troubleshooting unexpected behavior, or when previous fix attempts failed. Enforces investigate-first discipline. allowed-tools: '*'

Systematic Debugger

Find root cause before fixing. Symptom fixes are failure.

Iron Law: NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

When to Use

Answer IN ORDER. Stop at first match:

Bug, error, or test failure? → Use this skill
Unexpected behavior? → Use this skill
Previous fix didn't work? → Use this skill (especially important)
Performance problem? → Use this skill
None of above? → Skip this skill

Use especially when:

Under time pressure (emergencies make guessing tempting)
"Quick fix" seems obvious (red flag)
Already tried 1+ fixes that didn't work

The Four Phases

Complete each phase before proceeding.

Phase 1: Root Cause Investigation

BEFORE attempting ANY fix:

1. Read Error Messages Completely

Don't skip past errors. They often contain the exact solution.
- Full stack trace (note line numbers, file paths)
- Error codes and messages
- Warnings that preceded the error

When the error mentions a library/framework symbol (a function, class, or module from a dependency) — read the installed version's docs before forming a hypothesis. Signatures, defaults, and deprecation behavior change between versions; training memory may describe a different version than the one running. Check package.json / lockfile, then the source wired up (Context7, official docs, README at the pinned ref).

2. Reproduce Consistently

Can reproduce?	Action
Yes, every time	Proceed to step 3
Sometimes	Gather more data - when does it happen vs not?
Never	Cannot debug what you cannot reproduce - gather logs

When repro hinges on user-only context (env, timeline, "what were you doing"), call /elicit — its multiple-choice format constrains the answer space better than open prompts.

3. Check Recent Changes

git diff HEAD~5       # Recent code changes
git log --oneline -10 # Recent commits

What changed that could cause this? Dependencies? Config? Environment?

4. Trace Data Flow (Root Cause Tracing)

When error is deep in call stack:

Symptom: Error at line 50 in utils.js
    ↑ Called by handler.js:120
    ↑ Called by router.js:45
    ↑ Called by app.js:10 ← ROOT CAUSE: bad input here

Technique:

Find where error occurs (symptom)
Ask: "What called this with bad data?"
Trace up until you find the SOURCE
Fix at source, not at symptom

5. Multi-Component Systems

When system has multiple layers (API → service → database):

# Log at EACH boundary before proposing fixes
echo "=== Layer 1 (API): request=$REQUEST ==="
echo "=== Layer 2 (Service): input=$INPUT ==="
echo "=== Layer 3 (DB): query=$QUERY ==="

Run once to find WHERE it breaks. Then investigate that layer.

Phase 2: Pattern Analysis

1. Find Working Examples

Locate similar working code in same codebase. What works that's similar?

2. Identify Differences

Working code	Broken code	Could this matter?
Uses async/await	Uses callbacks	Yes - timing
Validates input	No validation	Yes - bad data

List ALL differences. Don't assume "that can't matter."

Phase 3: Hypothesis Testing

1. Form Single Hypothesis

Write it down: "I think X is the root cause because Y"

Be specific:

❌ "Something's wrong with the database"
✅ "Connection pool exhausted because connections aren't released in error path"

2. Test Minimally

Rule	Why
ONE change at a time	Isolate what works
Smallest possible change	Avoid side effects
Don't bundle fixes	Can't tell what helped

3. Evaluate Result

Result	Action
Fixed	Phase 4 (verify)
Not fixed	NEW hypothesis (return to 3.1)
Partially fixed	Found one issue, continue investigating

Root Cause Checkpoint (REQUIRED)

Before proceeding to Phase 4:

Document root cause in ticket (under "## Root Cause" section):
- What is the actual cause (not symptom)?
- Why did this happen?
- How was it confirmed?
If using tickets: Update frontmatter to subtype: bug-investigated

Example:

## Root Cause

Connection pool exhausted because connections aren't released in error path.
Confirmed by adding pool logging - saw connections increment without decrement on errors.

Phase 4: Implementation

1. Create Failing Test

Before fixing, write test that fails due to the bug:

it('handles empty input without crashing', () => {
  // This test should FAIL before fix, PASS after
  expect(() => processData('')).not.toThrow();
});

2. Implement Fix

Address ROOT CAUSE identified in Phase 1
ONE change
No "while I'm here" improvements

3. Verify

New test passes
Existing tests still pass
Issue actually resolved (not just test passing)

End the fix report with Next: on its own line — imperative, name what's now (commit message to use, ticket to mark done, follow-up issue to open for related debt the investigation surfaced). A debug session that ends without naming the next move is incomplete.

Examples:

Next: commit with fix(auth): release pool connection in error path, mark ticket 234 done.
Next: commit the fix, then open a ticket for the unrelated logging gap surfaced at src/auth.ts:88.

4. If Fix Doesn't Work

Fix attempts	Action
1-2	Return to Phase 1 with new information
3+	STOP - Question architecture (see below)

5. After 3+ Failed Fixes: Question Architecture

Pattern indicating architectural problem:

Each fix reveals new coupling/shared state
Fixes require "massive refactoring"
Each fix creates new symptoms elsewhere

STOP and ask:

Is this pattern fundamentally sound?
Should we refactor vs. continue patching?
Discuss with user before more fix attempts
If user confirms the pattern is fundamentally unsound, call /figure-it-out to evaluate approaches (refactor / redesign / extract) before implementing anything.

Red Flags - STOP Immediately

If you catch yourself thinking:

Thought	Reality
"Quick fix for now, investigate later"	Investigate NOW or you never will
"Just try changing X"	That's guessing, not debugging
"I'll add multiple fixes and test"	Can't isolate what worked
"I don't fully understand but this might work"	You need to understand first
"One more fix attempt" (after 2+ failures)	3+ failures = wrong approach

ALL mean: STOP. Return to Phase 1.

Quick Reference

Phase	Key Question	Success Criteria
1. Root Cause	"WHY is this happening?"	Understand cause, not just symptom
2. Pattern	"What's different from working code?"	Identified key differences
3. Hypothesis	"Is my theory correct?"	Confirmed or formed new theory
4. Implementation	"Does the fix work?"	Test passes, issue resolved

Voice: plainspoken and concise — write to be scanned.