name: observability description: Log, metric, and trace analysis methodology. Use when analyzing logs, investigating errors, querying metrics, or correlating signals across observability backends (Coralogix, Datadog, CloudWatch).
Observability Analysis
Core Principle: Statistics Before Samples
NEVER start by reading raw logs. Always begin with aggregated statistics:
- Volume: How many logs in the time window?
- Distribution: Which services/levels/error types?
- Trends: Is it increasing, stable, or decreasing?
- THEN sample: Get specific entries after understanding the landscape
Available Backends
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for API keys in environment variables - they won't be there. Just use the backend scripts directly; authentication is handled transparently.
Available backends (invoke with /skill-name):
- Coralogix (DataPrime) -
/observability-coralogix - Datadog -
/observability-datadog - Honeycomb -
/observability-honeycomb - Splunk (SPL) -
/observability-splunk - Elasticsearch/OpenSearch -
/observability-elasticsearch - Jaeger (Tracing) -
/observability-jaeger
To check if a backend is working, try a simple query rather than checking env vars.
Backend-Specific Skills
- Coralogix:
/observability-coralogix- DataPrime syntax, log/trace analysis - Datadog:
/observability-datadog- DQL syntax, metrics and APM - Honeycomb:
/observability-honeycomb- High-cardinality analysis, distributed tracing - Splunk:
/observability-splunk- SPL syntax, saved searches - Elasticsearch:
/observability-elasticsearch- Lucene/Query DSL - Jaeger:
/observability-jaeger- Distributed tracing, latency analysis
Analysis Framework
Step 1: Get the Big Picture
- Total log volume
- Error rate and distribution
- Which services are most affected
Step 2: Identify Patterns
- Error clustering (many errors in short time)
- Temporal patterns (started at X time)
- Service correlation (Service A errors → Service B errors)
Step 3: Sample Strategically
- Sample from error peaks
- Get examples of each distinct error type
- Compare against baseline period
Output Format
When reporting observability findings, use this structure:
## Log Analysis Summary
### Time Window
- Start: [timestamp]
- End: [timestamp]
- Duration: X hours
### Statistics
- Total logs: X events
- Error count: Y events (Z%)
- Services affected: N services
- Error rate trend: [increasing/stable/decreasing]
### Top Error Services
1. [service1]: N errors
2. [service2]: M errors
### Error Patterns
- Primary error type: [description]
- First occurrence: [timestamp]
- Correlation: [deployment/traffic/external event]
### Sample Errors
[Quote 2-3 representative error messages with context]
### Root Cause Hypothesis
[Based on patterns observed]
### Confidence Level
[High/Medium/Low with explanation]