name: systematic-debugging description: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
Skill: systematic-debugging
When
Any bug, test failure, or unexpected behavior — before proposing fixes.
Follows ARCS CLI Primer:
arcs --commands --jsonfor discovery,--json --leanon all calls.
Flow
flowchart TD
classDef decision fill:#f59e0b,color:#fff
classDef stop fill:#ef4444,color:#fff
Bug[Bug observed] --> ARCS[Check ARCS knowledge]
ARCS --> Found{Match found?}
Found -->|Yes| Verify[Verify it applies]
Found -->|No| Observe
Verify -->|Applies| Fix
Verify -->|Doesn't apply| Observe
Observe[Phase 1: Observe] --> Repro{Reproducible?}
Repro -->|No| Instrument[Add logging/tracing]
Instrument --> Observe
Repro -->|Yes| Hypothesize[Phase 2: Hypothesize]
Hypothesize --> Compare[Find working example, list differences]
Compare --> Theory[Form single specific hypothesis]
Theory --> Isolate[Phase 3: Isolate]
Isolate --> Test{Smallest change confirms?}
Test -->|Yes| Fix[Phase 4: Fix]
Test -->|No| FailCount{3+ failures?}
FailCount -->|No| Theory
FailCount -->|Yes| Arch[Question architecture]
Fix --> WriteFail[Write failing test]
WriteFail --> Implement[Single targeted fix]
Implement --> Green{Scoped tests pass?}
Green -->|Yes| Capture[Capture as ARCS knowledge]
Green -->|No| FailCount
class Found,Repro,Test,FailCount,Green decision
class Arch stop
Phase 1: Observe (Root Cause Investigation)
- Read the actual error message completely
- Reproduce consistently before proceeding
- Check recent changes (
git log,git diff) - Trace data flow backward from failure point
- Instrument component boundaries if cause unclear
- Pre-step:
arcs knowledge search <slug> "<error>" --jsonfor gotcha/lesson/pattern entries
Phase 2: Hypothesize (Pattern Analysis)
- Find a working example in the same codebase
- Compare working vs broken — list every difference
- Understand the dependency chain
- Form ONE specific hypothesis (not multiple)
Phase 3: Isolate
- Test with the smallest possible change
- One variable at a time — never stack fixes
- If hypothesis fails, form a new one from evidence
- Escalation: 3+ failed fixes → question the architecture, not the symptom
Phase 4: Fix
- Write a failing test FIRST (proves the bug exists)
- Implement a single targeted fix
- Verify the scoped tests for the files you changed pass (your dispatch VERIFY command — never the full suite; the devil-advocate completion gate owns that)
- If your fix introduces new failures in YOUR scoped tests, revert and return to Phase 2. Failures in files outside your scope are report-only (BLOCKED_BY) — likely a sibling agent's in-flight work; never fix or revert it
Log Triage Protocol
Scan order: failure point → errors → warnings → timing anomalies
rg -n "ERROR|FATAL|panic|exception" <logfile> # Error grep
jq 'select(.level == "error")' <json-log> # Structured logs
Output: Timeline of events leading to failure (T-5m, T-3m, T-0).
Git Bisect (Regressions)
git bisect start
git bisect bad HEAD
git bisect good <last-known-good>
git bisect run <test-command>
After finding the commit: read the diff, isolate specific lines, feed into Phase 2.
Dependency Conflict Diagnosis
| Symptom | Likely Cause |
|---|---|
instanceof fails across modules |
Duplicate package copies |
| Type mismatch on same interface | Different versions loaded |
| "Cannot find module" intermittent | Hoisting conflict |
Works with --legacy-peer-deps |
Peer dep unsatisfied |
Diagnose: npm ls <pkg>, npm explain <pkg>, check for multiple copies.
ARCS Knowledge Capture
After root cause identified, persist as knowledge:
- gotcha — environmental/config traps
- lesson — architectural insights from this session
- pattern — reusable solution to recurring problem
Include: root cause summary, evidence, affected files, fix approach.
Capture Resolution as Knowledge
After resolving the issue, persist the learning:
# For a surprising behavior or trap
arcs knowledge create <slug> "Redis connection pool exhaustion under load" \
--kind=gotcha \
--summary="Pool size defaults to 10; under concurrent requests >50, connections time out silently" \
--body="Root cause: default pool size. Fix: set poolSize to max(50, expectedConcurrency). Symptoms: intermittent 503s with no error logs." \
--json
# For a reusable debugging technique or resolution pattern
arcs knowledge create <slug> "Diagnosing silent connection failures" \
--kind=lesson \
--summary="Enable connection-level event logging before load testing" \
--body="Attach listeners to pool 'error' and 'timeout' events. Default Node.js behavior swallows these." \
--json
# For a pattern that should be followed going forward
arcs knowledge create <slug> "Connection pool sizing formula" \
--kind=pattern \
--summary="Pool size = max(50, 2x expected peak concurrency)" \
--body="Applies to Redis, Postgres, and HTTP agent pools. Validated under load test 2026-05-26." \
--json
Kind selection guide:
gotcha— surprising behavior, trap, or non-obvious failure modelesson— learned technique, debugging approach, resolution methodpattern— reusable solution that should be applied going forward
Constraints
- NO FIXES WITHOUT ROOT CAUSE INVESTIGATION. If Phase 1 incomplete, you cannot propose fixes.
- One variable at a time. Never apply multiple changes simultaneously.
- 3+ failures = architectural problem. Stop fixing symptoms, question the pattern.
- Test before fix. Failing test proves the bug; green test proves the fix.
- Defense in depth: After fixing root cause, add validation at multiple layers to prevent recurrence.
- Systematic is faster than thrashing. 15-30min systematic vs 2-3h random fixes.
Red Flags (Return to Phase 1)
- "Quick fix for now, investigate later"
- "Just try changing X and see"
- Proposing solutions before tracing data flow
- Each fix reveals a new problem in a different place
- "I don't fully understand but this might work"
- Human says "stop guessing" or "is that not happening?"