name: troubleshoot description: > Structured diagnostic workflow for investigating failures, unexpected behavior, or broken systems. Use when something is broken and you need to find out why — NOT when you need to implement a feature (use alfred-agent:develop for that). Separate from develop because troubleshooting has its own discipline: cleanup contract upfront, hypothesis-driven probes, ground-truth verification, narrow interventions, conclude-or-escalate, teardown. allowed-tools: Bash, Read, Write, Edit, Glob, Grep, Agent
alfred-agent:troubleshoot
Structured diagnostic workflow for investigating failures, unexpected behavior, or broken systems.
When to use vs. develop
- Use troubleshoot: something is broken, behaving unexpectedly, or failing — you need to find out why
- Use develop: you know what to build and need to implement it
- Never mix: don't start troubleshoot and slide into implementation; if you find the fix, stop, document it, and invoke develop
Step 0 — Cleanup contract (do this first)
Before touching anything, state explicitly:
- What you will probe — commands, logs, and files you'll read
- What you will NOT change during diagnosis
- Cleanup: what will be left behind (temp files, test processes, log noise) and how you'll remove them
This prevents the common failure of leaving the system worse than you found it.
Step 1 — State the symptom precisely
Write one sentence: "X is happening when Y is expected."
If you cannot write this sentence, you do not understand the problem yet — ask the user before proceeding. Do not begin probing until the symptom is stated clearly.
Step 2 — Gather ground truth
Before forming hypotheses: look at actual state. Do not theorize from memory.
- Read relevant logs (last N lines, not the whole file)
- Check process state:
ps aux | grep {process} - Check port bindings, file existence, config values — whatever is relevant to the symptom
- State what you found: "actual state is X"
Gather facts first. Hypotheses come after.
Step 3 — Form hypotheses
List 2–3 candidate root causes, ordered by likelihood. For each:
- Hypothesis: "The problem is caused by..."
- Test: "I can verify this by..."
- Falsifier: "This hypothesis is wrong if..."
Do not probe yet. Write all hypotheses before running any tests.
Step 4 — Probe narrowly
Test hypotheses one at a time, starting with the most likely. For each probe:
- State what you're testing
- Run the minimal command that tests it
- Report what you found
- Update the hypothesis list (confirm, eliminate, or revise)
Stop when you have a confirmed root cause OR have eliminated all hypotheses (→ Step 5, escalate path).
If a probe changes system state, note it explicitly.
Step 5 — Conclude or escalate
Escalation thresholds (check before each probe)
Stop immediately and escalate if either threshold is breached:
- Side-effects ≥ 5 — you have created 5 or more side-effects during this session
- 30+ minutes elapsed without progress — 30 or more minutes have passed since the last confirmed or eliminated hypothesis
Definition of side-effect — any of the following counts as one side-effect:
- Created a K8s resource (Pod, Deployment, Job, Namespace, Secret, etc.)
- Scaffolded a repo or directory structure outside the working directory
- Made a persistent host config change (modified
/etc/, systemd unit, cron entry) - Created a
/tmpartifact that will outlive the current troubleshoot loop - Sent an agent-message to another agent
Definition of progress — a hypothesis confirmed OR eliminated by evidence:
- "Checked logs, error X is NOT present → hypothesis A eliminated" = progress
- "Reproduced the failure with input Y → hypothesis B confirmed" = progress
- Spinning in place, retrying the same command, or generating new hypotheses without confirming/eliminating old ones does NOT count as progress
On threshold breach:
- Stop immediately — do not take further actions
- Summarise current state: hypotheses tested, which were confirmed, which were eliminated
- List all side-effects created so far (for cleanup)
- Escalate to user before continuing:
"Escalation threshold reached. Summary: [state]. Side-effects: [list]. Please review before I continue."
If root cause confirmed
- State it clearly: "Root cause: ..."
- Recommend fix: "Fix: ..." (do not implement — hand off to
alfred-agent:develop) - Note any side effects or risks
If root cause not confirmed after exhausting hypotheses
- Report what was ruled out
- State what information is still missing
- Escalate to user: "I've eliminated X, Y, Z. The following would help narrow it further: ..."
- Do NOT guess and implement
Step 6 — Teardown
Execute the cleanup contract from Step 0:
- Remove temp files created during diagnosis
- Stop any test processes started during diagnosis
- Report what was cleaned up
Leave the system in the same state it was in before diagnosis began (or better, never worse).
Rules
- Never implement a fix during troubleshoot — that is develop's job
- Cleanup contract before first probe, no exceptions
- Ground truth before hypotheses — read actual state, do not recall it from memory
- Narrow probes only — test one hypothesis at a time
- Conclude-or-escalate — do not leave the user with an open-ended "maybe it's X"
- Note state changes — if a probe changes system state, say so explicitly
- Escalate at threshold — side-effects ≥ 5 or 30+ min without progress: stop, summarise, list side-effects, escalate before continuing