name: rca description: Root cause analysis for process crashes, server deaths, and unexplained system failures using macOS logs, session forensics, and code tracing disable-model-invocation: true
Root cause analysis for a system-level failure. Gather ALL evidence before reasoning about the cause.
Incident description: $ARGUMENTS
Instructions
Follow these phases strictly in order. Do NOT speculate or reason about root cause until Phase 5. Every phase has a gate.
Phase 1: Scope the Incident
Parse the incident description — what died? (process, tmux server, container, service, etc.)
Establish the time window: ask the user or derive from context when the failure was noticed and when things last worked.
Identify what was running at the time — check:
tmux list-sessions/tmux list-panes -a(if tmux is back up)- Shell history with timestamps to reconstruct user activity:
tail -500 ~/.zsh_history | while IFS= read -r line; do if echo "$line" | grep -q "^: [0-9]"; then ts=$(echo "$line" | sed 's/^: \([0-9]*\):.*/\1/') cmd=$(echo "$line" | sed 's/^: [0-9]*:[0-9]*;//') dt=$(date -r "$ts" "+%Y-%m-%d %H:%M:%S" 2>/dev/null) echo "$dt $cmd" fi done - Filter to the time window around the incident.
Identify all active agents/processes — check Claude Code session logs:
ls -lt ~/.claude/projects/<project-dir>/*.jsonl | head -15Cross-reference modification times with the incident window.
Gate: Time window is established. List of active processes/sessions at the time is known.
Phase 2: System-Level Evidence (macOS Unified Log)
Run ALL of the following. Do not skip any — even if you expect empty results, the absence of evidence is evidence.
2a. Target process events
log show --start "<start>" --end "<end>" --style compact \
--predicate 'process == "<target-process>"'
Replace <target-process> with the crashed process (e.g., "tmux", "tmux-server", "agent-dashboard"). Run for each relevant process name.
2b. Code signature / AMFI failures
log show --start "<start>" --end "<end>" --style compact \
--predicate 'subsystem == "com.apple.AMFI" OR eventMessage CONTAINS "AMFI" OR eventMessage CONTAINS "code signature" OR eventMessage CONTAINS "CMS blob"'
2c. Process termination signals
log show --start "<start>" --end "<end>" --style compact \
--predicate 'eventMessage CONTAINS "SIGKILL" OR eventMessage CONTAINS "SIGTERM" OR eventMessage CONTAINS "SIGHUP" OR eventMessage CONTAINS "taskgated"'
2d. Memory pressure (jetsam)
log show --start "<start>" --end "<end>" --style compact \
--predicate 'category == "jetsam" OR eventMessage CONTAINS "memorystatus" OR eventMessage CONTAINS "jetsam"'
2e. Power management (sleep/wake)
log show --start "<start>" --end "<end>" --style compact \
--predicate 'subsystem == "com.apple.powerd" OR eventMessage CONTAINS "Wake reason" OR eventMessage CONTAINS "DarkWake" OR eventMessage CONTAINS "caffeinate"'
2f. Kernel events
log show --start "<start>" --end "<end>" --style compact \
--predicate 'sender == "kernel"' | head -50
2g. Crash reports
find ~/Library/Logs/DiagnosticReports /Library/Logs/DiagnosticReports \
-name "*<process-name>*" -newer <reference-file> 2>/dev/null
2h. Application-level crash logs
Check for any crash/diagnostic files the application writes (e.g., crash.log, debug-keys.log, state files).
Gate: All 8 log categories checked. Results recorded with exact timestamps. Note which categories returned empty.
Phase 3: Session Forensics
For each Claude Code session active during the incident window:
Extract all Bash tool calls — these are the commands Claude actually executed:
python3 << 'PYEOF' import json, os fpath = "<session-jsonl-path>" with open(fpath) as f: for line in f: try: obj = json.loads(line) if obj.get('type') == 'assistant': content = obj.get('message', {}).get('content', []) if isinstance(content, list): for block in content: if isinstance(block, dict) and block.get('type') == 'tool_use' and block.get('name') == 'Bash': cmd = block.get('input', {}).get('command', '') print(f'CMD: {cmd[:400]}') print() except: pass PYEOFFlag dangerous commands — search for:
kill,pkill,killall(process termination)tmux kill-*(tmux destruction)rm -rf,git clean,git reset --hard(destructive ops)signal,SIGKILL,SIGTERM(signal sending)- Any command referencing the crashed process
Extract subagent launches — check for Agent tool calls, especially background agents:
# Same pattern but filter for block.get('name') == 'Agent' # Check run_in_background, prompt contentIdentify the LAST command before the crash — cross-reference the session's final tool call timestamp with the system log timestamps from Phase 2.
Gate: Every active session's commands are extracted. The last command before the crash is identified with timestamp.
Phase 4: Code Path Analysis
Trace the code paths that were active at the time of the crash:
What was the last command doing? — if it was a build/test/run command, check:
- Does it spawn subprocesses? (
exec.Command,go test,npm test) - Do those subprocesses interact with the crashed system? (e.g., tmux commands, port binding)
- Are the spawned binaries code-signed? (critical on macOS)
- Does it spawn subprocesses? (
Check the crashed application's code for:
- All interactions with the system that died (e.g.,
grep -rn "tmux" internal/ cmd/) - Cleanup/teardown code that could cascade (e.g., session cleanup, pane cleanup)
- Signal handlers and their behavior
- Panic/crash recovery mechanisms
- Race conditions in concurrent code
- All interactions with the system that died (e.g.,
Instrument production runners to trace leaked subprocess calls — when tests crash an external system (tmux, databases, etc.), add debug logging to every production runner that spawns subprocesses. Log to a persistent file so the output survives the crash:
// Add to every production runner method that calls exec.Command func debugLogExec(caller, name string, args []string) { f, _ := os.OpenFile("/tmp/test-exec.log", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) if f != nil { defer f.Close() fmt.Fprintf(f, "[%s] %s %v\n", caller, name, args) } }Run tests, then inspect the log to identify which packages and code paths are leaking real subprocess calls. An empty log file (or no file created) confirms zero leaks. This is especially useful when subprocess calls are indirect — e.g., an HTTP handler calls a function that calls a package function that eventually hits
exec.Commandthrough 3+ layers of indirection.Check configuration that affects crash behavior:
- tmux:
exit-empty,exit-unattached,destroy-unattached - Process supervisors: launchd plists, systemd units
- Application-level restart/respawn logic
- tmux:
Check recent code changes in the area:
git log --oneline --since="1 week ago" -- <relevant-paths>
Gate: Code paths from the last command to the crash point are traced. Configuration that affects cascading failures is documented.
Phase 5: Root Cause Analysis
Only now may you reason about the cause. Build the argument from evidence, not speculation.
Construct the event chain — a timestamped sequence from the trigger to the final failure. Every link must cite evidence from Phases 2-4:
HH:MM:SS.mmm— [source: unified log / session log / crash log] — event description
Identify the root cause — the earliest event in the chain that, if prevented, would have avoided the failure. Distinguish:
- Root cause: the underlying defect or condition
- Contributing factors: things that made it worse (e.g.,
exit-empty on) - Symptoms: what the user observed
Verify the chain is complete — are there gaps? If yes, state them explicitly as unknowns.
Rule out alternatives — for each plausible alternative cause, cite the evidence that eliminates it:
Alternative Ruled out by Example: manual kill-server No kill-server in any session log Example: sleep/wake SIGKILL No powerd sleep events in window
Phase 6: Report
Present a structured report:
Timeline
A table with timestamp, event, and evidence source for every link in the chain.
Root Cause
One paragraph explaining what happened and why, with evidence citations.
Contributing Factors
Bullet list of conditions that enabled or worsened the failure.
Evidence Summary
| Category | Result |
|---|---|
| AMFI/codesign | Found / Not found |
| Jetsam | Found / Not found |
| Sleep/wake | Found / Not found |
| Signals | Found / Not found |
| Session commands | Suspicious / Clean |
| Crash reports | Found / Not found |
Ruled-Out Alternatives
Table from Phase 5.
Recommended Fix
Concrete, minimal actions to prevent recurrence. Reference specific files and line numbers.
Gate: Report delivered. Every claim cites evidence. Unknowns are stated explicitly.
Transition to implementation
This skill is read-only. If the user asks to implement a fix based on your findings, do not start editing files. Instead, hand off to the appropriate skill:
- Bug fix or prevention → suggest
/agent-dashboard:fix <description> - Structural change to prevent recurrence → suggest
/agent-dashboard:refactor <description> - New safeguard or feature → suggest
/agent-dashboard:feature <description>
These skills handle branch/worktree setup, TDD, review, and delivery. Starting implementation inline from /agent-dashboard:rca skips those gates.