alerting-context - SKILL.md Agent Skill

name: alerting-context description: Pull incident context from alerting platforms (PagerDuty). Use when investigating who's on-call, incident history, alert patterns, or MTTR metrics. allowed-tools: Bash(python *)

Alerting Context

Authentication

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for PAGERDUTY_API_KEY in environment variables - it won't be visible to you. Just run the scripts directly; authentication is handled transparently.

Why Alerting Context Matters

Before diving into logs and metrics, understand:

Has this happened before? Check similar past incidents
Who's responding? Know who's on-call and assigned
What else is alerting? Correlated alerts reveal scope
How long do similar issues take? MTTR sets expectations

Available Scripts

All scripts are in .claude/skills/alerting-context/scripts/

get_incident.py - Get Incident Details

python .claude/skills/alerting-context/scripts/get_incident.py --id INCIDENT_ID [--timeline]

# Examples:
python .claude/skills/alerting-context/scripts/get_incident.py --id P123ABC
python .claude/skills/alerting-context/scripts/get_incident.py --id P123ABC --timeline

list_incidents.py - List Incidents with Filters

python .claude/skills/alerting-context/scripts/list_incidents.py [--status STATUS] [--days N] [--limit N]

# Examples:
python .claude/skills/alerting-context/scripts/list_incidents.py
python .claude/skills/alerting-context/scripts/list_incidents.py --status triggered
python .claude/skills/alerting-context/scripts/list_incidents.py --status acknowledged --limit 10
python .claude/skills/alerting-context/scripts/list_incidents.py --days 30

calculate_mttr.py - Calculate Mean Time To Resolve

python .claude/skills/alerting-context/scripts/calculate_mttr.py [--service SERVICE_ID] [--days N]

# Examples:
python .claude/skills/alerting-context/scripts/calculate_mttr.py
python .claude/skills/alerting-context/scripts/calculate_mttr.py --days 30
python .claude/skills/alerting-context/scripts/calculate_mttr.py --service PSERVICE123 --days 90

Investigation Workflow

Step 1: Get Current Incident Context

# Get details of the current incident
python get_incident.py --id P123ABC --timeline

Returns:

Incident title, status, urgency
Service affected
Who acknowledged, when
Timeline of actions taken

Step 2: Find Similar Past Incidents

# Get incidents from the last 30 days
python list_incidents.py --days 30 --status resolved

# Check for patterns in a specific service
python list_incidents.py --service PSERVICE123 --days 90

Look for:

Same alert title recurring → Known issue or flapping
Cluster of alerts → Systemic problem
Low ack rate → Possible alert fatigue

Step 3: Check Historical MTTR

# Get MTTR for this service
python calculate_mttr.py --service PSERVICE123 --days 30

Returns:

Average MTTR (minutes/hours)
Median MTTR
95th percentile
Fastest/slowest resolution

Quick Commands Reference

Goal	Command
Get incident	`get_incident.py --id P123ABC`
With timeline	`get_incident.py --id P123ABC --timeline`
Active incidents	`list_incidents.py --status triggered`
Acknowledged	`list_incidents.py --status acknowledged`
Last 30 days	`list_incidents.py --days 30`
Calculate MTTR	`calculate_mttr.py --service X --days 30`

Common Patterns

Pattern 1: "Is this a known issue?"

# Search for similar alerts in last 30 days
python list_incidents.py --days 30

# Check the output for recurring alert titles
# Look for same service, similar patterns

Pattern 2: "Escalation Investigation"

# Get full incident details with timeline
python get_incident.py --id P123ABC --timeline

# Check 'assignments' and 'acknowledgements' in output
# Timeline shows escalation events

Pattern 3: "SLA/MTTR Tracking"

# Get MTTR for incident comparison
python calculate_mttr.py --service PSERVICE123 --days 30

# Compare current incident duration to historical average
# If current > p95, this is an unusually long incident

Output Format

## Alerting Context Summary

### Current Incident
- **ID**: [incident_id]
- **Title**: [title]
- **Status**: [triggered/acknowledged/resolved]
- **Service**: [service_name]
- **Urgency**: [high/low]
- **Created**: [timestamp]
- **Duration**: [how long since created]

### On-Call
- **Primary**: [name] ([email])
- **Secondary**: [name] ([email])
- **Escalation Policy**: [policy_name]

### Historical Context
- **Similar incidents (30d)**: N incidents with same/similar title
- **Average MTTR for this service**: X minutes
- **This alert fires**: Z times/week on average

### Recommendations
- [If recurring] Review runbook for this alert
- [If long duration] Consider escalating
- [If noisy] Consider tuning alert threshold

Anti-Patterns to Avoid

❌ Ignoring past incidents - Always check if it's a known issue
❌ Not checking on-call - Know who's responding before investigating
❌ Missing correlated alerts - One incident might mask the real issue
❌ Forgetting MTTR context - Know what "normal" resolution looks like
❌ Unbounded queries - Always use time ranges to avoid timeout