coralogix-analysis - SKILL.md Agent Skill

name: coralogix-analysis description: Coralogix log analysis with DataPrime query language. Use when querying Coralogix logs, metrics, or traces. Provides syntax reference and intelligent investigation scripts.

Coralogix Analysis

Authentication

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for CORALOGIX_API_KEY or other API keys in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.

Configuration environment variables you CAN check (non-secret):

CORALOGIX_DOMAIN - Team hostname (e.g., myteam.app.cx498.coralogix.com)
CORALOGIX_REGION - Region code (e.g., us2, eu1) - fallback if domain not set

Region mapping (the scripts auto-detect based on domain):

US1: *.app.coralogix.us → api.us1.coralogix.com
US2: *.app.cx498.coralogix.com → api.us2.coralogix.com
EU1: *.coralogix.com → api.eu1.coralogix.com
EU2: *.app.eu2.coralogix.com → api.eu2.coralogix.com
AP1: *.app.coralogix.in → api.ap1.coralogix.com
AP2: *.app.coralogixsg.com → api.ap2.coralogix.com

MANDATORY: Statistics-First Investigation

NEVER dump raw logs. Always follow this pattern:

STATISTICS → SAMPLE → SIGNATURES → CORRELATE

Statistics First - Know volume, error rate, and top patterns before sampling
Strategic Sampling - Choose the right strategy based on statistics
Pattern Extraction - Cluster similar errors to find root causes
Context Correlation - Investigate around anomaly timestamps

Available Scripts

All scripts are in .claude/skills/observability-coralogix/scripts/

PRIMARY INVESTIGATION SCRIPTS

get_statistics.py - ALWAYS START HERE

Comprehensive statistics with pattern extraction and anomaly detection.

python .claude/skills/observability-coralogix/scripts/get_statistics.py [--service SERVICE] [--app APP] [--time-range MINUTES]

# Examples:
python .claude/skills/observability-coralogix/scripts/get_statistics.py --time-range 60
python .claude/skills/observability-coralogix/scripts/get_statistics.py --service payment --app otel-demo

Output includes:

Total count, error count, error rate percentage
Severity distribution
Top error patterns (crucial for quick triage)
Time bucket anomalies (spike/drop detection via z-score)
Top services by log volume
Actionable recommendation

sample_logs.py - Strategic Sampling

Choose the right sampling strategy based on statistics.

python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy STRATEGY [--service SERVICE] [--app APP]

# Strategies:
#   errors_only   - Only ERROR/CRITICAL logs (default for incidents)
#   around_anomaly - Logs within time window of specific timestamp
#   first_last    - First N/2 + last N/2 logs (timeline view)
#   random        - Random sample across time range
#   all           - All severity levels (use sparingly)

# Examples:
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy errors_only --service payment
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy around_anomaly --timestamp "2026-01-27T05:00:00Z" --window 60
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy first_last --service checkout --limit 50

extract_signatures.py - Pattern Clustering

Normalize and cluster log messages to see unique issue patterns.

python .claude/skills/observability-coralogix/scripts/extract_signatures.py --service SERVICE [--severity SEVERITY] [--max-signatures N]

# Examples:
python .claude/skills/observability-coralogix/scripts/extract_signatures.py --service payment --severity ERROR
python .claude/skills/observability-coralogix/scripts/extract_signatures.py --app otel-demo --max-signatures 30

Normalizes variable parts (UUIDs, IPs, timestamps, numbers) to find:

Dominant error patterns (> 50% = single root cause likely)
Diverse errors (many patterns = multiple issues)
Affected services per pattern

UTILITY SCRIPTS

list_services.py - Service Discovery

python .claude/skills/observability-coralogix/scripts/list_services.py [--time-range MINUTES]

get_health.py - Quick Health Check

python .claude/skills/observability-coralogix/scripts/get_health.py <service> [--time-range MINUTES]

get_errors.py - Quick Error Fetch

python .claude/skills/observability-coralogix/scripts/get_errors.py <service> [--app APPLICATION] [--time-range MINUTES]

query_logs.py - Raw DataPrime Queries

For custom queries not covered by other scripts.

python .claude/skills/observability-coralogix/scripts/query_logs.py "<dataprime_query>" [--time-range MINUTES] [--limit N]

DataPrime Syntax Quick Reference

Filters

# Equality (use == not =)
$l.subsystemname == 'api-server'

# Severity - use ENUM values (no quotes!)
# Valid: VERBOSE, DEBUG, INFO, WARNING, ERROR, CRITICAL
$m.severity == ERROR
$m.severity == WARNING || $m.severity == ERROR

# Text search (case-insensitive) - use ~~ not 'contains'
$d ~~ 'timeout'
$d ~~ 'connection refused'

# Combine filters with &&
$l.subsystemname == 'payment' && $m.severity == ERROR

Aggregations

# Count
| aggregate count() as total

# Group by field
| groupby $l.subsystemname aggregate count() as cnt

# Time bucketing
| timebucket 5m aggregate count() as cnt

# Multiple aggregations
| groupby $l.subsystemname aggregate count() as cnt, avg($d.duration) as avg_duration

# Order and limit
| orderby cnt desc | limit 20

Common Fields

$l.applicationname - Application/environment name (e.g., "otel-demo")
$l.subsystemname - Service name (e.g., "payment", "checkout")
$m.severity - Log level enum: VERBOSE, DEBUG, INFO, WARNING, ERROR, CRITICAL
$m.timestamp - Event timestamp
$d - Log message/data (use ~~ for text search)

Common Query Patterns

1. List all services with log counts

source logs | groupby $l.subsystemname aggregate count() as cnt | orderby cnt desc | limit 30

2. Error count by service

source logs | filter $m.severity == ERROR | groupby $l.subsystemname aggregate count() as errors | orderby errors desc

3. Error rate over time

source logs | filter $m.severity == ERROR | groupby $m.timestamp / 5m as bucket aggregate count() as errors | orderby bucket asc

4. Errors for specific service

source logs | filter $l.subsystemname == 'payment' | filter $m.severity == ERROR | limit 50

5. Search for specific error message

source logs | filter $d ~~ 'connection refused' | limit 20

Advanced DataPrime Patterns

Bracket Notation for Special Fields

K8s fields often have dots in names. Use bracket notation:

# Wrong - treats as nested path
$d.kubernetes.namespace

# Correct - literal field name with dot
$d['kubernetes.namespace']
$d['resource.attributes.k8s_pod_name']

Time-Based Comparisons

Compare logs before/after a time threshold:

# Count logs in last hour vs older
source logs | countby if($m.timestamp > now() - 1h, 'last_hour', 'older')

# Find logs older than 5 minutes
source logs | filter $m.timestamp < now() - 5m

K8s Container Restarts

Find unstable containers:

source logs
| choose resource.attributes.k8s_container_restart_count:number as restarts,
         resource.attributes.k8s_container_name as container,
         resource.attributes.k8s_deployment_name as deployment
| filter restarts > 0
| groupby deployment aggregate max(restarts) as max_restarts
| orderby max_restarts desc

Peak Error Window

Find the 10-minute window with most errors:

source logs
| filter $m.severity == ERROR
| groupby $m.timestamp / 10m as bucket aggregate count() as cnt
| orderby cnt desc
| limit 5

Fuzzy Search All Fields

When you don't know which field contains the value:

# Search all fields for text
source logs | filter $d ~~ 'connection refused'

# Or use wildfind
source logs | wildfind 'timeout'

Anti-Patterns to Avoid

❌ NEVER skip statistics - get_statistics.py is MANDATORY first step
❌ Unbounded queries - Always specify time ranges and limits
❌ Quoting severity values - Use enum: ERROR not 'ERROR'
❌ Using 'contains' - Use ~~ operator for text search
❌ Missing application filter - For multi-tenant, filter by $l.applicationname
❌ Fetching all logs - Use sampling strategies, not limit 10000
❌ Ignoring anomaly timestamps - Use around_anomaly to investigate spikes
❌ Reading logs without patterns - Always extract signatures for RCA
❌ Dot notation for K8s fields - Use bracket notation: $d['k8s.pod.name']

Investigation Workflow

Standard Incident Investigation

┌─────────────────────────────────────────────────────────────┐
│ 1. STATISTICS FIRST (mandatory)                              │
│    python get_statistics.py --service <service>              │
│    → Know volume, error rate, top patterns, anomalies        │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
                     Dominant Issue?
               ┌─────────────┴─────────────┐
               │                           │
      YES (>80% one pattern)               NO (mixed errors)
               │                           │
               ▼                           ▼
┌─────────────────────────────┐  ┌───────────────────────────────────────────┐
│ 2. FAST PATH                │  │ 2. DEEP DIVE                              │
│    Sample errors directly   │  │    python extract_signatures.py           │
│    python sample_logs.py    │  │    python sample_logs.py --strategy ...   │
│    → Verify hypothesis      │  │    → Cluster and analyze patterns         │
└─────────────────────────────┘  └───────────────────────────────────────────┘

Example: Payment Service Investigation

# Step 1: Statistics first - ALWAYS
python .claude/skills/observability-coralogix/scripts/get_statistics.py --service payment --time-range 60
# Output: 15,432 logs, 847 errors (5.5%), top pattern: "Connection timeout to downstream"

# IF dominant pattern found:
# Step 2: Verify with samples
python .claude/skills/observability-coralogix/scripts/sample_logs.py --strategy errors_only --service payment --limit 10

Quick Commands Reference

Goal	Command
Start investigation	`get_statistics.py --service X`
See error variety	`extract_signatures.py --service X`
Sample errors only	`sample_logs.py --strategy errors_only --service X`
Investigate spike	`sample_logs.py --strategy around_anomaly --timestamp T`
Timeline view	`sample_logs.py --strategy first_last --service X`
List all services	`list_services.py`
Custom query	`query_logs.py "source logs

Trace Investigation

Use traces to understand request flow and latency across services.

When to Use Traces vs Logs

Use Case	Tool
"What errors happened?"	Logs (`get_statistics.py`)
"Why is this request slow?"	Traces (`get_slow_spans.py`)
"Where did the request fail?"	Traces (`get_traces.py`)
"What's the service dependency?"	Traces (operation analysis)

Trace Scripts

get_traces.py - Find Spans

# Get spans for a service
python .claude/skills/observability-coralogix/scripts/get_traces.py --service checkout --time-range 30

# Get all spans for a trace ID
python .claude/skills/observability-coralogix/scripts/get_traces.py --trace-id abc123def456

# Filter by operation
python .claude/skills/observability-coralogix/scripts/get_traces.py --operation "/api/checkout" --service checkout

get_slow_spans.py - Latency Analysis

# Find spans slower than 500ms
python .claude/skills/observability-coralogix/scripts/get_slow_spans.py --min-duration 500

# Find slow spans in specific service
python .claude/skills/observability-coralogix/scripts/get_slow_spans.py --min-duration 200 --service checkout

# Get latency statistics by service (recommended first step)
python .claude/skills/observability-coralogix/scripts/get_slow_spans.py --stats

DataPrime Spans Syntax

Spans use source spans but with different field names than logs:

# List spans for a service (use serviceName, not $l.subsystemname)
source spans | filter serviceName == 'checkout' | limit 50

# Find slow spans (duration in MICROSECONDS)
source spans | filter duration > 500000 | orderby duration desc | limit 20

# Get all spans for a trace (use top-level traceID)
source spans | filter traceID == 'abc123def456...' | limit 100

# Latency statistics by service
source spans | groupby serviceName aggregate avg(duration) as avg_dur, max(duration) as max_dur | orderby avg_dur desc

Span Fields Reference (different from logs!)

operationName - Operation name (e.g., HTTP GET /checkout)
serviceName - Service name (equivalent to logs' $l.subsystemname)
applicationName - Application name
duration - Span duration in microseconds
traceID - Trace identifier (32-char hex)
spanID - Span identifier
parentId - Parent span ID (for trace tree)
tags - Span metadata (e.g., http.status_code, rpc.method)
process.tags - Resource attributes (e.g., k8s.pod.name)

Trace Investigation Workflow

┌─────────────────────────────────────────────────────────────┐
│ 1. CHECK LATENCY STATS                                       │
│    python get_slow_spans.py --stats                          │
│    → See which services have high latency                    │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. FIND SLOW SPANS                                           │
│    python get_slow_spans.py --min-duration 500 --service X   │
│    → Get specific slow spans with trace IDs                  │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. TRACE FULL REQUEST                                        │
│    python get_traces.py --trace-id <id>                      │
│    → See all spans in the slow request                       │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. CORRELATE WITH LOGS                                       │
│    python sample_logs.py --strategy around_anomaly           │
│    → Get logs around the same timestamp                      │
└─────────────────────────────────────────────────────────────┘