observability

star 1

Log, metric, and trace analysis methodology. Use when analyzing logs, investigating errors, querying metrics, or debugging production issues.

incidentfox By incidentfox schedule Updated 2/23/2026

name: observability description: Log, metric, and trace analysis methodology. Use when analyzing logs, investigating errors, querying metrics, or debugging production issues. allowed-tools: Bash(aws *, kubectl *, python *)

Observability Analysis

Core Principle: Statistics Before Samples

NEVER start by reading raw logs. Always begin with aggregated statistics:

  1. Volume: How many logs in the time window?
  2. Distribution: Which services/levels/error types?
  3. Trends: Is it increasing, stable, or decreasing?
  4. THEN sample: Get specific entries after understanding the landscape

Analysis Framework

Step 1: Get the Big Picture

  • Total log volume
  • Error rate and distribution
  • Which services are most affected

Step 2: Identify Patterns

  • Error clustering (many errors in short time)
  • Temporal patterns (started at X time)
  • Service correlation (Service A errors → Service B errors)

Step 3: Sample Strategically

  • Sample from error peaks
  • Get examples of each distinct error type
  • Compare against baseline period

CloudWatch Logs (AWS)

Insights Queries

# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10

# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)

# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type

AWS CLI for CloudWatch

# List log groups
aws logs describe-log-groups --log-group-name-prefix /ecs/

# Quick error search
aws logs filter-log-events \
  --log-group-name <group> \
  --start-time <epoch-ms> \
  --filter-pattern "ERROR"

# Insights query
aws logs start-query \
  --log-group-name <group> \
  --start-time <epoch> --end-time <epoch> \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'

# Get query results
aws logs get-query-results --query-id <id>

Kubernetes Logs

# Pod logs (recent)
kubectl logs <pod> -n <namespace> --tail=100

# All pods in a deployment
kubectl logs deployment/<name> -n <namespace> --tail=50

# Previous container (after crash)
kubectl logs <pod> -n <namespace> --previous

# Follow logs in real-time
kubectl logs -f <pod> -n <namespace>

# Logs from specific time
kubectl logs <pod> -n <namespace> --since=1h

Metrics Analysis

Prometheus / kubectl top

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -n <namespace> --sort-by=cpu
kubectl top pods -n <namespace> --sort-by=memory

Key Metrics to Check

USE Method (for infrastructure):

  • Utilization: CPU, memory, disk, network bandwidth
  • Saturation: Queue depth, pending requests
  • Errors: Error counts, error rates

RED Method (for services):

  • Rate: Requests per second
  • Errors: Errors per second
  • Duration: Response time distribution

Output Format

When reporting observability findings:

## Observability Analysis

### Time Window
- Start: [timestamp]
- End: [timestamp]

### Statistics
- Total logs: X events
- Error count: Y events (Z%)
- Services affected: N services
- Error rate trend: [increasing/stable/decreasing]

### Top Error Services
1. [service1]: N errors
2. [service2]: M errors

### Error Patterns
- Primary error type: [description]
- First occurrence: [timestamp]
- Correlation: [deployment/traffic/external event]

### Sample Errors
[2-3 representative error messages with context]

### Root Cause Hypothesis
[Based on patterns observed]

### Confidence Level
[High/Medium/Low with explanation]
Install via CLI
npx skills add https://github.com/incidentfox/self-learning-ai-agent --skill observability
Repository Details
star Stars 1
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator