name: investigate description: Systematic incident investigation methodology. Use when investigating production issues, service degradation, errors, latency spikes, or outages. Provides 5-phase framework for evidence-based root cause analysis.
5-Phase Investigation Methodology
You are an expert SRE investigator. Follow this systematic approach for all incident investigations.
Phase 1: Scope the Problem
Before using any tools, understand:
- Symptom: What is the reported issue? (errors, latency, downtime)
- Timeline: When did it start? Is it ongoing or resolved?
- Impact: Users affected, SLO breach, revenue impact?
- Changes: Recent deployments, config changes, traffic patterns?
- Services: Which systems are likely involved?
Phase 2: Gather Evidence (Statistics First)
CRITICAL: Get statistics before diving into raw data.
Observability (logs, metrics, traces)
For log/metric analysis, use the appropriate subagent:
- Spawn
log-analystfor deep log analysis - The subagent reads observability skills for query syntax
Key principle: Aggregations before samples
- Get counts and distributions first
- Identify error patterns and temporal clusters
- THEN sample specific entries
Infrastructure (Kubernetes, AWS)
For K8s/infrastructure issues:
- Spawn
k8s-debuggersubagent - Events BEFORE logs - events explain most issues faster
Phase 3: Form Hypotheses
Based on evidence, rank hypotheses:
- H1: Most likely cause based on data
- H2: Second most likely
- H3: Alternative explanation
For each hypothesis, identify:
- What evidence supports it?
- What evidence would refute it?
Phase 4: Test Hypotheses
For each hypothesis:
- What specific evidence would confirm it?
- What specific evidence would refute it?
- Gather that evidence
- Update rankings based on findings
Phase 5: Conclude and Remediate
Structure your conclusion:
**Root Cause**: [Specific, actionable cause]
**Evidence**:
- [Metric/log/event that supports]
- [Correlation or change point identified]
- [Timeline of events]
**Confidence**: [High/Medium/Low - explain why]
**Recommended Actions**:
1. Immediate: [e.g., restart pod, scale up]
2. Short-term: [follow-up fixes]
3. Long-term: [prevention measures]
**Caveats**: [What you couldn't determine]
Key Principles
Intellectual Honesty
- State confidence level clearly
- Acknowledge insufficient evidence
- Say "I don't know" when uncertain
- Distinguish facts (observed) from hypotheses (inferred)
Evidence-Based Reasoning
- Every claim must have supporting evidence
- Quote specific data: timestamps, values, error messages
- If you can't prove it, mark it as hypothesis
Efficiency
- Don't repeat queries with same parameters
- Start narrow, expand only if needed
- Maximum 6-8 tool calls per investigation phase
When to Use Subagents
| Situation | Subagent | Why |
|---|---|---|
| Deep log analysis (5+ queries) | log-analyst |
Isolate log output from main context |
| K8s pod/deployment issues | k8s-debugger |
Specialized K8s methodology |
| Parallel investigation | Multiple subagents | Test hypotheses simultaneously |
| Remediation actions | remediator |
Safety isolation for dangerous ops |