name: observability description: Log, metric, and trace analysis methodology. Use when analyzing logs, investigating errors, querying metrics, or debugging production issues. allowed-tools: Bash(aws *, kubectl *, python *)
Observability Analysis
Core Principle: Statistics Before Samples
NEVER start by reading raw logs. Always begin with aggregated statistics:
- Volume: How many logs in the time window?
- Distribution: Which services/levels/error types?
- Trends: Is it increasing, stable, or decreasing?
- THEN sample: Get specific entries after understanding the landscape
Analysis Framework
Step 1: Get the Big Picture
- Total log volume
- Error rate and distribution
- Which services are most affected
Step 2: Identify Patterns
- Error clustering (many errors in short time)
- Temporal patterns (started at X time)
- Service correlation (Service A errors → Service B errors)
Step 3: Sample Strategically
- Sample from error peaks
- Get examples of each distinct error type
- Compare against baseline period
CloudWatch Logs (AWS)
Insights Queries
# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)
# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10
# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)
# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
AWS CLI for CloudWatch
# List log groups
aws logs describe-log-groups --log-group-name-prefix /ecs/
# Quick error search
aws logs filter-log-events \
--log-group-name <group> \
--start-time <epoch-ms> \
--filter-pattern "ERROR"
# Insights query
aws logs start-query \
--log-group-name <group> \
--start-time <epoch> --end-time <epoch> \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'
# Get query results
aws logs get-query-results --query-id <id>
Kubernetes Logs
# Pod logs (recent)
kubectl logs <pod> -n <namespace> --tail=100
# All pods in a deployment
kubectl logs deployment/<name> -n <namespace> --tail=50
# Previous container (after crash)
kubectl logs <pod> -n <namespace> --previous
# Follow logs in real-time
kubectl logs -f <pod> -n <namespace>
# Logs from specific time
kubectl logs <pod> -n <namespace> --since=1h
Metrics Analysis
Prometheus / kubectl top
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -n <namespace> --sort-by=cpu
kubectl top pods -n <namespace> --sort-by=memory
Key Metrics to Check
USE Method (for infrastructure):
- Utilization: CPU, memory, disk, network bandwidth
- Saturation: Queue depth, pending requests
- Errors: Error counts, error rates
RED Method (for services):
- Rate: Requests per second
- Errors: Errors per second
- Duration: Response time distribution
Output Format
When reporting observability findings:
## Observability Analysis
### Time Window
- Start: [timestamp]
- End: [timestamp]
### Statistics
- Total logs: X events
- Error count: Y events (Z%)
- Services affected: N services
- Error rate trend: [increasing/stable/decreasing]
### Top Error Services
1. [service1]: N errors
2. [service2]: M errors
### Error Patterns
- Primary error type: [description]
- First occurrence: [timestamp]
- Correlation: [deployment/traffic/external event]
### Sample Errors
[2-3 representative error messages with context]
### Root Cause Hypothesis
[Based on patterns observed]
### Confidence Level
[High/Medium/Low with explanation]