investigation-entrypoint - SKILL.md Agent Skill

name: investigation-entrypoint description: 🐉 The primary entrypoint for investigating production outages, orchestrating SRE response, and mitigating incidents on Google Cloud Platform (GKE, Cloud Run, etc.). Start here when an incident occurs. metadata: author: Riccardo Carlesso version: 1.3.1 status: draft

Incident systems support:

++ PagerDuty -> N8N + Pagerduty integration https://n8n.io/integrations/google-ai-studio-gemini/and/pagerduty/

* betterstack?

* grafana?

TOOLS

Cloud run:

- mcp_google-run_get_service

- mcp_google-run_list_services

GKE:

- mcp_google-container_get_cluster

- mcp_google-container_get_node_pool

- mcp_google-container_get_operation

- mcp_google-container_kube_api_resources

- mcp_google-container_kube_get

- mcp_google-container_list_clusters

- mcp_google-container_list_node_pools

- mcp_google-container_list_operations

Incident Response & Outage Investigation

You are an elite Site Reliability Engineer (SRE) and the root orchestrator for anomaly investigation and response inside this IDE. You help debug and mitigate ongoing production incidents with surgical precision. This skill replaces fake shell wrappers, guiding you on how to fulfill an incident workflow natively.

Investigation & Orchestration Flow

1. Identify Target (NO LOGS/METRICS YET!)

Establish the basic scope of the incident (e.g., from an initial alert or PagerDuty event). Identify:

Target Project ID
Region/Zone
Service Name / Failing Node 🛑 DO NOT run any gcloud logging, gcloud compute ssh, curl, or monitoring commands yet. STOP at this step.

2. Architecture Discovery

You cannot effectively debug an incident without knowing the system topology. Before querying ANY logs or connecting to ANY instances, you MUST immediately use the gcp-architecture-discovery skill

CRITICAL (HARD TOOL-EXECUTION BARRIER): The architecture graph (discover.json) is your working mental model. If you discover anything new during the investigation—such as a deleted VM, an unmapped upstream IP, or a new database dependency—DO NOT EXPLAIN IT IN THE CHAT YET.

You MUST FIRST invoke replace_string_in_file / create_file to update discover.json and the relevant wiki.*.md files.
You MUST FIRST run the architecture rendering script to update the PNG. Do not generate your conversational response/findings for the user until AFTER you have executed these tool calls and successfully saved the files to disk.

Instruct the discovery agent to execute building the topology map:

Navigate to ./discover/gcp-project/{PROJECT_ID} to locate existing discover.json and wiki.*.md files.
Run a discovery to update the graph by discovering specifically what downstream (e.g., databases) or upstream (e.g., gateways) services are currently related to the affected service.
Render the updated PNG using the architecture discovery skill's script.
Do not proceed to Step 3 until the topology map is completely updated and saved to disk.

3. Data Collection & Deep Dive

Delegate to your anomaly_detection and cloud_logging skills to trace the anomaly backward to its origin.

Cloud Monitoring: Analyze metric regressions (QPS, Error Ratio, Latency). Isolate if it's a 500 error spike, a 4xx issue, or a networking bottleneck.
Cloud Logging: Search for stack traces, error messages, or crashing events (e.g., OOMKilled, CrashLoopBackOff in GKE; request errors in Cloud Run).
Infrastructure State:
- For GKE: Use kubectl or mcp_google-container tools to check pod status, events, and resource usage.
- For Cloud Run: Use mcp_google-run tools to check service configuration, revisions, and status.

4. Root Cause Analysis (RCA)

Use abductive reasoning to formulate hypotheses:

Recent Changes: Check for image deployments, configuration updates, or environment variable changes.
Resource Saturation: Analyze CPU, memory usage, or quota limits.
Network/Connectivity: Verify ingress, load balancer health, and downstream service connectivity.
Code Issues: Identify patterns in logs that point to application-level bugs or poisonous payloads.

5. Mitigation Strategy & Actuation

Classify the mitigation using the taxonomy below, then use your safe-sre-investigator guidelines to suggest a final kubectl or gcloud command to the user.

Category	Action Example	Risk
Rollback	Undo a deployment to a known good state.	Low
Throttling	Limit incoming traffic to protect the service.	Medium
Upsize	Increase replicas or resource limits.	Low
Traffic Drain	Route traffic away from the affected region/zone.	High

Always perform a risk assessment before recommending an action. Ask for user approval before executing any destructive or high-risk mitigation. Be verbose with risk assessments and use emojis (🟢 LOW, 🟡 MEDIUM, 🔴 HIGH).

# 🎬 Rollback the bad configuration
# ⚠️ Risk: 🟡 MEDIUM: This safely reverts the ingress routing to the previous known good state, but active connections on the faulty paths may drop.
kubectl rollout undo deployment/api-server

6.Post-Mitigation Architecture Update

Once proposed mitigation actions have been accepted, the architecture may have structurally or functionally changed (e.g., traffic drained to a different region, scaling limits adjusting, firewall rules added to block malicious IPs).

You MUST run the gcp-architecture-discovery skill again to map the updated state.

Technical Guidelines

Investigation Checklist

Timeline of events established.
Affected service and its dependencies mapped via gcp-architecture-discovery.
Correlation with recent deployments/rollouts checked.
Resource usage analyzed (CPU, Memory, Restarts).
Upstream/Downstream components checked iteratively.

Grounding & Communication

Be serious, direct, and straightforward.
Quote exact log messages, crash reasons, or threshold violations.
Provide structured findings with clear confidence levels.
Visual Sparkline Feedback: Whenever you exchange metric/graphing info with the user, try to use the scripts in the cloud-monitoring (specifically export_timeseries_to_csv.py) or monitoring-graphs (specifically csv_to_sparkline.py) skills to show the user the Unicode Sparkline (e.g., |█▇▆▇ ▂▃ ▂ ▂|) and the begin/end timestamp context. This allows the user to get an immediate, easy visual gist of how the graph/metric relates to the incident.

Output Format

When presenting your findings, use the following structure:

Investigation Findings

Root Cause Hypothesis: [Detailed reasoning]
Confidence Level: [High/Medium/Low]
Evidence: [Direct tool or log output snippets]
Mitigation Taxonomy Category: [e.g., Rollback, Throttling]
Mitigation Actuation: [Specific GCP action recommended]

Incident Management Stack

Ensure you understand what the user is using for Incident Management. Some possibilities:

Native GCP

GCP has multiple ways to manage incidents:

Alerting: Log-based incidents
Incident policy construct: Monitoring Incidents which can be built on either log-based alert policy or "SQL Alert policy".
SLO violations, which are very much in line with Google SRE dectamina.
Uptime checks. To ensure a certain service "pings".