name: bm.systematic-troubleshooting description: "Systematic root cause troubleshooting: 4-phase method (Reproduce → Isolate → Hypothesize → Verify). Use when something is broken, wrong, slow, weird, unexpected, intermittent, 出错, 报错, bug, 问题, 不对, 慢, 卡, 怪, 排查, 定位, debug, troubleshoot. Applicable to code bugs, product anomalies, process errors, data issues — any scenario where 'things are not happening as expected'." source: opencrew version: "20260530.01"
Skill: Systematic Troubleshooting
First reproduce, then isolate, then hypothesize, finally verify. Never rely on guessing. Applicable to all scenarios where "things are not happening as expected".
Not just code bugs — product process anomalies, data mismatches, device malfunctions, operations not responding, processes stuck — all apply.
Core Principles
- Evidence before hypotheses: Don't start fixing without reproducing
- Narrow the scope, don't broaden it: Each step should make the problem boundary clearer
- Hypotheses must be falsifiable: Can be verified with a single yes/no action
- Change the minimum: Only manipulate one variable at a time
- Must verify after fixing: Otherwise you may have just masked the issue
4-Phase Method
Phase 1: Reproduce
Goal: Make the problem appear in a controlled, stable, minimal-step manner.
Key actions:
- Find the minimal reproduction steps (remove all non-essential steps)
- Record environmental differences: when it occurs vs. when it doesn't
- Cannot stably reproduce: Record frequency and conditions, don't skip
Problem classification (determines next-step strategy):
| Type | Characteristics | Difficulty |
|---|---|---|
| Always reproducible | Occurs every time | Low |
| Condition-triggered | Only occurs with specific input/state/time | Medium |
| Intermittent | Occurs sometimes, no pattern | High (first try to make it condition-triggered) |
| One-time | Happened once and never again | Highest (first collect evidence, wait for recurrence) |
Anti-patterns:
- ❌ Start modifying code without reproducing → You won't know if the fix worked
- ❌ Reproduce once and think it's handled → Intermittent issues need multiple confirmations
Phase 2: Isolate
Goal: Compress the boundary of "where the problem is" from large to small.
Methods:
| Method | Operation | Applicable When |
|---|---|---|
| Bisect | Cut in half and see which side has the problem | Don't know which segment it's in |
| Swap | Change data/environment/device/account | Suspect environment |
| Revert | Go back to a known-good version | It was working before |
| Minimize | Strip down to minimal reproduction | Complex systems |
| Timeline | git bisect / review history | Regression issues |
Ask "Why" 5 Times:
- Surface: Error thrown
- Why the error? → Function returned None
- Why None? → Database query returned nothing
- Why nothing? → Field name was misspelled
- Why misspelled? → Schema was changed last week without syncing
- Root cause: Change management process missing
Fix the deepest layer you can fix, not the surface layer.
Phase 3: Hypothesize
Goal: Based on evidence, list 2-3 most likely root causes, each verifiable.
Criteria for a good hypothesis:
- ✅ Specific: "
portfield inconfig.jsonis string'8080'instead of number8080" - ✅ Falsifiable: Running one action tells you if it's right or wrong
- ❌ "Might be a configuration issue" — too vague
- ❌ "Probably a caching problem" — no evidence
List hypotheses ordered by probability:
Hypothesis 1 (most likely): A broke B
- Verification: Run X and check Y
- Evidence source: Error message mentions A
Hypothesis 2: Environment difference
- Verification: Reproduce on staging
- Evidence source: Works locally, doesn't work in production
Hypothesis 3 (fallback): Third-party dependency
- Verification: Check dependency versions
- Evidence source: Recently upgraded
Phase 4: Verify
Goal: Confirm whether the hypothesis is correct, then confirm whether the fix is effective.
Two-step verification:
Verify root cause: Use Phase 3's verification action to confirm the hypothesis
- Hypothesis wrong → Return to Phase 3, try next hypothesis
- Hypothesis correct → Proceed to fix
Verify fix:
- Change the minimum code / configuration / steps
- Run Phase 1's reproduction steps again — must no longer trigger the problem
- Run related tests / adjacent scenarios — no new issues introduced
- Use bm.verification to provide a completion report
Output Format
Document the troubleshooting process for user follow-up / future retrospectives:
# Troubleshooting Record: {Problem}
## Symptoms
- {One-sentence description}
- Reproduction steps:
1. ...
2. ...
## Environment
- {OS / version / configuration}
## Troubleshooting Timeline
### Phase 1: Reproduce ✅
- Successfully reproduced, always reproducible
- Minimal steps: ...
### Phase 2: Isolate
- Bisected X, problem is in module Y
- 5 Whys: ... → Root cause guess: ...
### Phase 3: Hypothesize
1. ❌ A: Ruled out after verification (evidence: ...)
2. ✅ B: Confirmed (evidence: ...)
### Phase 4: Verify
- Fixed ...
- Reproduction steps no longer trigger the issue ✅
- Adjacent tests passed ✅
## Root Cause
{One sentence}
## Fix
{What was changed, why it was changed this way}
## Prevention
- {Process/checklist item}
Anti-Patterns
| Anti-Pattern | Why It's Wrong | Correct Approach |
|---|---|---|
| Blindly changing things and hoping | If it works you don't know why; if it fails you waste time | Reproduce first + hypothesize |
| Skip reproduction and guess directly | Hypotheses lack evidence, direction may be completely wrong | Stably reproduce first |
| Change 5 things at once | Don't know which one worked | One variable at a time |
| Fix it and move on once it works | Just masking the issue, may recur | Fix root cause, not symptoms |
| Look only at the outermost error | Outermost is the entry point, not the root cause | Read stack traces from inside out |
| Don't check existing issues | Might be a known problem | Search GitHub Issues / historical troubleshooting records first |
| Don't verify after fixing | Might not be fixed or may have introduced new issues | Run reproduction steps + adjacent tests |
Non-Code Scenario Examples
Scenario: User says "one of my files is missing"
- Reproduce: Ask the user which file specifically, when they noticed, when they last saw it
- Isolate: Was it actually deleted or just can't be found? Trash/recent files/git log?
- Hypothesize: Accidental deletion / renamed / sync issue / path changed
- Verify: Each hypothesis corresponds to a search action
Scenario: No one showed up to the meeting
- Reproduce: Who came to the previous few meetings?
- Isolate: Is this one unusual or has it always been like this? Who sent the invitations?
- Hypothesize: Invitations didn't reach people / time conflicts / topic wasn't compelling / no one felt they should attend
- Verify: Spot-check a few people and ask
File Locations
- Troubleshooting records →
./working/troubleshoot-{topic}.md - Root cause summaries for important issues → Archive to
./postmortems/{date}-{topic}.md(code project →./docs/postmortems/{date}-{topic}.md)
Code Project Detection: If code project markers exist under cwd (package.json, Cargo.toml, go.mod, pyproject.toml, setup.py, pom.xml, Gemfile, composer.json, or src/ + .git/), final artifacts go under ./docs/ in corresponding subdirectories instead of the project root. Intermediate artifacts in ./working/ remain unchanged. User-specified paths take priority.