name: anomaly-detection description: Detect statistical anomalies in log patterns — sudden error rate spikes, unusual latency, memory pressure, or cold start storms. Use when asked about anomalies, trends, spikes, or unusual behavior.
Skill: Anomaly Detection in Logs
Use this workflow to detect abnormal patterns compared to a baseline.
Step-by-Step Instructions
1. Establish Baseline
Use generate_time_report to get error counts per time bucket:
generate_time_report(log_dir="dummy_logs/", hours=24)
This returns error frequencies bucketed by hour.
2. Compute Error Rate
For each time bucket:
error_rate = error_count / total_request_count- Baseline = median error rate over the period
- Anomaly threshold = baseline × 3 (or absolute >10% error rate)
3. Detect Specific Anomaly Types
Error Rate Spikes
- Flag any 5-minute window with error rate > 3x the hourly baseline
- Use
grepwith a time range to investigate that window
Latency Spikes
- Search REPORT lines:
grep(pattern="Duration: [2-9][0-9]{4}", file="...") - This matches durations ≥ 20,000ms (near timeout)
- Group by 10-minute window
Memory Pressure
- Search REPORT lines where
Max Memory Usedis > 90% ofMemory Size - Pattern: look for
Max Memory Used: ([4-9][0-9]{2}|[0-9]{4}) MBwhen Memory Size is 512 MB
Cold Start Storms
- Search REPORT lines containing
Init Duration - More than 5 cold starts per minute indicates a scale-out event or deployment
Error Cascades
- Look for one error type that starts appearing, then other error types follow within minutes
- This suggests upstream failure propagation
4. Correlate Anomalies
- Do multiple anomaly types coincide? (e.g., error spike + latency spike = downstream issue)
- Check if anomaly aligns with deployment time (look for "cold start" pattern)
5. Write Anomaly Report
Save to anomaly_report.md with:
- Timeline visualization (ASCII chart if possible)
- Each anomaly: type, time range, magnitude, affected requestIds (sample)
- Correlation analysis
- Recommended action
Anomaly Classification
| Type | Threshold | Likely Cause |
|---|---|---|
| Error spike | >3x baseline in 5min | Deployment, traffic spike, dependency failure |
| Latency spike | Duration >25s (>83% of 30s limit) | Downstream timeout, cold start |
| Memory pressure | MaxMem >90% of limit | Memory leak, large payload |
| Cold start storm | >5 cold starts/min | Scale-out, deployment |
| Error cascade | 3+ error types growing simultaneously | Upstream failure |