name: openevidence-incident-runbook description: | Execute OpenEvidence incident response procedures with triage, mitigation, and postmortem. Use when responding to OpenEvidence-related outages, investigating errors, or running post-incident reviews for clinical AI integration failures. Trigger with phrases like "openevidence incident", "openevidence outage", "openevidence down", "openevidence emergency", "clinical ai broken". allowed-tools: Read, Grep, Bash(kubectl:), Bash(curl:) version: 1.0.0 license: MIT author: Jeremy Longshore jeremy@intentsolutions.io
OpenEvidence Incident Runbook
Overview
Rapid incident response procedures for OpenEvidence clinical AI integration outages in healthcare environments.
Prerequisites
- Access to OpenEvidence status page
- kubectl access to production cluster
- Prometheus/Grafana access
- PagerDuty or on-call system access
- Communication channels (Slack, Teams)
Severity Levels
| Level | Definition | Response Time | Impact | Examples |
|---|---|---|---|---|
| P1 | Complete outage | < 15 min | Patient care affected | API unreachable, all queries failing |
| P2 | Degraded service | < 1 hour | Clinical workflows slowed | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Non-critical features | DeepConsult delays, webhook issues |
| P4 | No user impact | Next business day | Monitoring gaps | Alert noise, logging issues |
Critical Note: Clinical Impact
OpenEvidence outages may affect clinical decision-making. Always:
- Communicate clearly with clinical staff about degraded service
- Ensure fallback procedures are known (manual guidelines lookup)
- Never claim the system is available if it's degraded
- Document any clinical impact for post-incident review
Quick Triage
Step 1: Initial Assessment (2 minutes)
# 1. Check OpenEvidence status
curl -sf https://status.openevidence.com/api/v2/summary.json | jq '.status'
# 2. Check our integration health
curl -sf https://api.yourhealthcare.com/health/openevidence | jq
# 3. Check error rate (last 5 min)
curl -s 'localhost:9090/api/v1/query?query=rate(openevidence_errors_total[5m])' | jq '.data.result[0].value[1]'
# 4. Recent error logs
kubectl logs -l app=clinical-evidence-api --since=5m | grep -i error | tail -20
Step 2: Decision Tree
OpenEvidence API returning errors?
├─ YES: Is status.openevidence.com showing incident?
│ ├─ YES → OpenEvidence outage. Enable fallback. Wait for resolution.
│ └─ NO → Our integration issue. Check credentials, config, network.
└─ NO: Is our service healthy?
├─ YES → Likely resolved or intermittent. Monitor closely.
└─ NO → Our infrastructure issue. Check pods, memory, network.
Incident Response by Error Type
401/403 - Authentication Errors
# Verify API key is set
kubectl get secret openevidence-secrets -o jsonpath='{.data.api-key}' | base64 -d | wc -c
# Check if key format is correct (should start with oe_)
kubectl get secret openevidence-secrets -o jsonpath='{.data.api-key}' | base64 -d | head -c 3
# REMEDIATION: Rotate API key
# 1. Generate new key in OpenEvidence dashboard
# 2. Update secret
kubectl create secret generic openevidence-secrets \
--from-literal=api-key=NEW_KEY \
--from-literal=org-id=EXISTING_ORG_ID \
--dry-run=client -o yaml | kubectl apply -f -
# 3. Restart pods to pick up new secret
kubectl rollout restart deployment/clinical-evidence-api
429 - Rate Limited
# Check current rate limit status
curl -s -H "Authorization: Bearer ${OPENEVIDENCE_API_KEY}" \
https://api.openevidence.com/v1/rate-limit | jq
# Check rate limit metrics
curl -s 'localhost:9090/api/v1/query?query=openevidence_rate_limit_remaining' | jq
# REMEDIATION: Enable request queuing
kubectl set env deployment/clinical-evidence-api RATE_LIMIT_MODE=queue
# Long-term: Contact OpenEvidence for limit increase
# Enterprise: sales@openevidence.com
500/503 - OpenEvidence Server Errors
# Confirm it's OpenEvidence-side
curl -v -H "Authorization: Bearer ${OPENEVIDENCE_API_KEY}" \
https://api.openevidence.com/health 2>&1 | grep "< HTTP"
# REMEDIATION: Enable graceful degradation
kubectl set env deployment/clinical-evidence-api OPENEVIDENCE_FALLBACK=true
# Notify clinical staff
./scripts/notify-clinical-degraded.sh
# Update status page
./scripts/update-status-page.sh degraded "OpenEvidence service degraded"
Timeout Errors
# Check current timeout setting
kubectl get deployment clinical-evidence-api -o jsonpath='{.spec.template.spec.containers[0].env}' | jq '.[] | select(.name=="OPENEVIDENCE_TIMEOUT")'
# Check network latency to OpenEvidence
time curl -sf https://api.openevidence.com/health
# REMEDIATION: Increase timeout temporarily
kubectl set env deployment/clinical-evidence-api OPENEVIDENCE_TIMEOUT=60000
# Check for network issues
kubectl exec -it deployment/clinical-evidence-api -- nslookup api.openevidence.com
kubectl exec -it deployment/clinical-evidence-api -- traceroute api.openevidence.com
Fallback Procedures
Enable Graceful Degradation
// When OpenEvidence is unavailable, return helpful fallback
const FALLBACK_RESPONSE = {
answer: 'Clinical evidence service is temporarily unavailable. ' +
'Please consult UpToDate, DynaMed, or current clinical guidelines directly.',
citations: [],
confidence: 0,
fallback: true,
disclaimer: 'This is a fallback response. The AI clinical decision support is currently unavailable.',
};
async function clinicalQueryWithFallback(question: string) {
try {
return await openEvidence.query(question);
} catch (error) {
// Log for incident tracking
logger.error({ error, question: '[REDACTED]' }, 'OpenEvidence query failed, using fallback');
// Alert operations
await alertOps('OpenEvidence unavailable, fallback active');
return FALLBACK_RESPONSE;
}
}
Notify Clinical Staff
#!/bin/bash
# scripts/notify-clinical-degraded.sh
MESSAGE="[Clinical Alert] AI clinical decision support (OpenEvidence) is experiencing issues.
Please use alternative resources (UpToDate, DynaMed, clinical guidelines) until further notice.
Status updates: https://status.yourhealthcare.com"
# Send to clinical Slack channel
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$MESSAGE\"}" \
"$CLINICAL_SLACK_WEBHOOK"
# Send email to clinical leadership
./scripts/send-clinical-email.sh "$MESSAGE"
Communication Templates
Internal (Slack/Teams)
:red_circle: **P1 INCIDENT: OpenEvidence Clinical AI**
**Status:** INVESTIGATING
**Started:** [timestamp]
**Impact:** Clinical queries returning errors/degraded
**Current Action:** [What you're doing]
**Fallback:** Graceful degradation enabled
**Next Update:** [time]
**Incident Commander:** @[name]
External (Clinical Staff)
**Clinical Decision Support Alert**
Our AI-powered clinical evidence tool (OpenEvidence) is currently experiencing issues.
**Impact:** Clinical queries may be slower or unavailable
**Alternative Resources:**
- UpToDate: uptodate.com
- DynaMed: dynamed.com
- Clinical guidelines directly
We are actively working to restore service and will provide updates.
Last Updated: [timestamp]
Status Page
**OpenEvidence Integration Issue**
We're experiencing issues with our clinical AI integration.
Some clinical evidence queries may be slow or unavailable.
**Impact:** [Specific impact]
**Workaround:** Use alternative clinical resources
We're actively investigating and will provide updates.
Last Updated: [timestamp]
Post-Incident Procedures
Evidence Collection
#!/bin/bash
# scripts/collect-incident-evidence.sh
INCIDENT_ID=${1:-$(date +%Y%m%d-%H%M)}
OUTPUT_DIR="incidents/${INCIDENT_ID}"
mkdir -p "$OUTPUT_DIR"
# Export logs (last 2 hours)
kubectl logs -l app=clinical-evidence-api --since=2h > "$OUTPUT_DIR/app-logs.txt"
# Export metrics
curl "localhost:9090/api/v1/query_range?query=openevidence_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > "$OUTPUT_DIR/error-metrics.json"
curl "localhost:9090/api/v1/query_range?query=openevidence_request_duration_seconds&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > "$OUTPUT_DIR/latency-metrics.json"
# Export alerts
curl "localhost:9093/api/v2/alerts" > "$OUTPUT_DIR/alerts.json"
# Sanitize (remove any PHI)
./scripts/sanitize-incident-data.sh "$OUTPUT_DIR"
echo "Evidence collected in $OUTPUT_DIR"
Postmortem Template
## Incident Postmortem: OpenEvidence [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Incident Commander:** [Name]
### Summary
[1-2 sentence description of what happened]
### Clinical Impact
- Queries affected: [number]
- Users impacted: [number]
- Fallback used: Yes/No
- Any patient care impact: [description or "None identified"]
### Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | Incident declared |
| HH:MM | Fallback enabled |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |
| HH:MM | Incident closed |
### Root Cause
[Technical explanation of what caused the incident]
### Contributing Factors
- [Factor 1]
- [Factor 2]
### What Went Well
- [Positive outcome 1]
- [Positive outcome 2]
### What Could Be Improved
- [Improvement area 1]
- [Improvement area 2]
### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Preventive measure] | [Name] | [Date] | Pending |
| [Detection improvement] | [Name] | [Date] | Pending |
### Lessons Learned
[Key takeaways for the team]
Quick Reference Commands
# One-line health check
curl -sf https://api.yourhealthcare.com/health/openevidence | jq '.status' || echo "UNHEALTHY"
# Check OpenEvidence status
curl -sf https://status.openevidence.com/api/v2/summary.json | jq '.status.indicator'
# Error rate last 5 min
curl -s 'localhost:9090/api/v1/query?query=rate(openevidence_errors_total[5m])/rate(openevidence_requests_total[5m])*100' | jq '.data.result[0].value[1]'
# Enable fallback mode
kubectl set env deployment/clinical-evidence-api OPENEVIDENCE_FALLBACK=true
# Disable fallback mode (restore normal operation)
kubectl set env deployment/clinical-evidence-api OPENEVIDENCE_FALLBACK=false
# Restart pods
kubectl rollout restart deployment/clinical-evidence-api
# Watch pod status
kubectl get pods -l app=clinical-evidence-api -w
Output
- Quick triage procedure completed
- Issue identified and categorized
- Remediation applied
- Clinical staff notified
- Evidence collected for postmortem
Resources
- OpenEvidence Status
- OpenEvidence Support
- Internal: On-Call Runbook
- Internal: Escalation Matrix
Next Steps
For PHI data handling, see openevidence-data-handling.