name: oncall-lookup description: | Look up Temporal Cloud oncall operational knowledge — alert triage, incident patterns, dashboards, commands, gotchas, and component behavior. Use this skill whenever the user asks about: oncall triage or alert handling, PagerDuty alert names or symptoms, Temporal Cloud operational issues (latency, errors, scaling, replication, visibility), what to do when paged, cell/namespace debugging, Astra/WAL/OpenSearch operational behavior, or anything related to running Temporal Cloud in production. Also use when the user mentions specific alert names like "PersistenceLatencySevereFiveMinutes" or symptoms like "shard ownership lost". Even if the user doesn't say "oncall", if their question is about Temporal Cloud operational triage, production incidents, or alert response — use this skill. Triggers: "oncall", "paged", "alert", "triage", "incident", "PagerDuty", "cell health", "shard ownership", "persistence latency", "visibility latency", "ResourceExhausted", "replication lag", "DLQ", "Astra", "WAL", "OpenSearch", "chronicle", "gotcha", "what do I do when", "how to handle", "on-call"
Oncall Knowledge Lookup
This skill gives you access to a structured knowledge base synthesized from 502 real
Temporal Cloud incidents, 14 months of #oncall-hosted-service messages (20,000), and
9,114 PagerDuty alerts. The repo lives at `/src/temporal-all/repos/oncall`.
What's available
The repo contains three tiers of information, from synthesized to raw:
Tier 1: Synthesized artifacts (start here)
oncall-knowledge-base.md (~2,300 lines) — the primary artifact. Contains:
- Symptom-to-Alert Quick Reference (lines 7-36): table mapping observed symptoms to likely alert names and which section to read next
- Universal Triage Checklist (lines 39-110): first-5-minutes procedure when paged
- Gotchas & Tribal Knowledge (lines 112-204): non-obvious failure modes organized by subsystem (auth, rate limiting, database, UI, migration, monitoring)
- Essential Tools & Dashboards (lines 207-375): Grafana URLs, CLI commands, Chronicle queries, Prometheus queries, runbook links, escalation paths
- Alert Reference Cards (lines 377-2045): ~80 individual alert types, each with
frequency stats, auto-resolution rate, median resolution time, top affected cells,
root causes, triage steps, and resolution actions. Organized into four groups:
- Persistence, Database & Latency (lines 377-825)
- Replication, Migration & Visibility (lines 825-1333)
- Infrastructure, Scaling & Metrics (lines 1333-1845)
- Control Plane, Canary, Workflow & Operational (lines 1845-2045)
- Symptom-Based Diagnosis (lines 2047-2172): decision trees for common symptoms like "I'm seeing high API latency", "I'm seeing ResourceExhausted errors", etc.
- Component Quick Reference (lines 2174-end): per-component summaries for Astra/Cassandra, WAL/BookKeeper, OpenSearch, Control Plane, Networking/Proxy, Metrics Pipeline
Tier 2: Structured correlation data
In correlation/:
alert-stats.json— per-alert-type statistics: total count, monthly average, median resolution time (minutes), auto-resolution rate, top affected cellssymptom-map.json— symptom keywords mapped to alert types with frequency countsalert-incidents.json— alert types mapped to formal incident channel references
These are useful for quantitative questions ("which alerts fire most often?", "what's the auto-resolution rate for X?", "which cells are most affected by Y?").
Tier 3: Raw and intermediate data (rarely needed)
incidents/*/slack.txt— stripped incident channel transcriptsincidents/*/extracted.json— per-incident LLM-extracted diagnostics, root cause, resolutiononcall-channel/YYYY-MM.txt— stripped monthly oncall channel transcriptsoncall-channel/YYYY-MM-extracted.json— per-month extracted alert patternspagerduty/YYYY-MM.json— raw PagerDuty incident data
Use Tier 3 only when Tiers 1 and 2 don't answer the question — e.g., the user wants the full transcript of a specific incident, or wants to find a conversation about a particular customer or namespace that isn't covered in the synthesized material.
How to answer questions
For alert-specific questions
- Search
oncall-knowledge-base.mdfor the alert name (e.g., grep forPersistenceLatencySevereFiveMinutes). Each alert reference card has frequency, resolution time, root causes, and triage steps. - If you need quantitative stats beyond what's in the card, read
correlation/alert-stats.jsonand look up the alert name. - If you need to know which incidents involved this alert, check
correlation/alert-incidents.json.
For symptom-based questions ("I'm seeing X")
- Check the Symptom-to-Alert Quick Reference table (top of knowledge base) to identify which alerts correspond to the symptom.
- Read the Symptom-Based Diagnosis section (lines 2047-2172) for the decision tree.
- Then look up specific alert reference cards as needed.
For "how do I triage" / "what do I do when paged"
Read the Universal Triage Checklist (lines 39-110).
For "what's the gotcha with X" / tribal knowledge
Read the Gotchas & Tribal Knowledge section (lines 112-204), which is organized by subsystem: auth, rate limiting, database, UI/frontend, migration, monitoring.
For tooling questions (dashboards, commands, queries)
Read the Essential Tools & Dashboards appendix (lines 207-375).
For component behavior questions
Read the Component Quick Reference (lines 2174-end) for Astra, WAL, OpenSearch, Control Plane, Networking, or Metrics Pipeline.
For quantitative / statistical questions
Read the JSON files in correlation/:
alert-stats.jsonfor per-alert frequency, resolution time, auto-resolution ratesymptom-map.jsonfor symptom-to-alert mappings with countsalert-incidents.jsonfor alert-to-incident linkage
For deep-dive into a specific past incident
Search incidents/*/extracted.json for keywords, or grep incidents/*/slack.txt for
specific namespaces, cells, or error messages.
Important context
- Data covers Jan 2025 – Feb 2026 (14 months)
oncall-knowledge-base.mdsupersedes the olderplaybook.mdandoncall-guide.md- Alert reference cards include PagerDuty statistics — these are historical frequencies and may shift as the system evolves
- The repo is at
~/src/temporal-all/repos/oncall— all paths above are relative to it