name: gcp-architecture-discovery description: π [SRE] Discover and map GCP infrastructure architecture including compute, networking, storage, and service dependencies. metadata: author: sstawski version: 0.1.1
GCP Architecture Discovery
Overview
This skill provides a discovery approach for GCP infrastructure. It discovers what's needed for the current investigation and caches results for future use.
When to Use
- When investigating a failing service or starting an incident response.
- When an incident has been resolved and the system topology has changed (e.g., firewall blocked an IP, traffic diverted).
- When initializing a new GCP project for the first time.
- Identifying dependencies between services during incidents.
- Creating architecture diagrams for human and agent understanding.
- Auditing specific resource configurations.
Philosophy & Agent Mindset
The Architecture Graph is Your Mental Model.
The discover.json and generated PNG represent how you see the system right now. Do not hesitate to update, rewrite, or expand the graph whenever you discover something new. For example: If you realize a Load Balancer connects to a different backend than previously drawn, overwrite it immediately.
Discovery Strategy (Adaptive Blast Radius) Exhaustive scans of an entire massive project waste time, but lazily scanning just one node causes you to miss critical infrastructure changes (e.g. altered firewalls, deleted databases, changed IAM policies).
- Find the Anchor: Start at the service or node experiencing the incident.
- Map the Full Blast Radius: You MUST aggressively discover and map ALL resources functionally or topologically connected to the anchor. This includes checking:
- Upstream Load Balancers or API Gateways.
- Surrounding Network boundaries (VPCs, Subnets, Firewalls, NATs). Did any rules change?
- Downstream targets (Databases, Pub/Sub topics, Cloud Storage). Do they still exist?
π¨ CRITICAL RULE: NEVER stop after discovering just the single affected resource. You MUST forcefully map its entire surrounding ecosystem. Relying on old cache assumptions causes you to miss deleted, unmapped, or altered infrastructure.
- Update Cache: Merge the discovered ecosystem slice into the existing graph.
State Management
π¨ CRITICAL RULE: MANDATORY OUTPUT GENERATION Setting up or updating the discovery folder structure is NOT OPTIONAL. Every time this skill is triggered, you MUST ALWAYS create or update the corresponding target folder,
.json,.mdand generate the.pngvisualization accurately. Never leave the data exclusively in the chat context.
The folder structure MUST be strictly organized as follows:
{AGENT_WORKSPACE_ROOT}/discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_SUBSCRIPTION_NAME}/
Inside each project/subscription directory, maintain these explicit files:
discover.json(Structured Data Cache):- Contains a list of instances.
- Contains references/dependencies to other resources.
- Holds the raw Mermaid code inside a
"mermaid"key for rendering. - Easy for the agent to parse programmatically.
wiki.<category>.md(Topological Documentation by Resource Type):- Contains markdown documentation explaining how these pieces of architecture work together, separated into distinct files based on resource category (e.g.,
wiki.gce.mdfor VMs,wiki.vnet.mdfor networks,wiki.cloudrun.mdfor Cloud Run,wiki.sql.mdfor databases). - One file,
wiki.overview.md, MUST contain the main topology image link () and the high-level project description. - MANDATORY STRUCTURE: Each resource/app running on GCP MUST be documented under an
##(H2) header inside its respective category file. - Important sub-points (such as properties, dependencies, configurations) MUST use
###(H3) or####(H4) headers under the respective resource. - All resources/apps must explicitly list their most important properties (e.g., regions, versions, tier/scaling limits) and their dependencies.
- Easy for humans and reasoning models to read contextually.
- Contains markdown documentation explaining how these pieces of architecture work together, separated into distinct files based on resource category (e.g.,
Core Discovery Workflow
Step 1: Check Cache First
# Agent checks documentation for architecture
# Files: ./discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_NAME}/discover.json and ./discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_NAME}/wiki.*.md
Step 2: Discovery Sweep (Blast Radius)
- Query GCP to
listordescribecore resources connected to the incident. Do not rely on old cache assumptionsβif a resource is throwing errors, verify its existence, its properties, and its relations. - Identify everything in the event's blast radius: services, DBs, storage buckets, VPCs, Firewalls, and their direct upstream callers/downstream dependencies. Look out for missing, newly created, or modified resources in this radius.
Step 3: Discover Dependencies & Network Layers
- For the discovered compute services β check if they call other services (via environment variables, service mesh). Determine Serverless VPC Access connections if any.
- For the discovered GKE services β check backend services they depend on. Inspect VPC, subnets, and Network Policies.
- Discover Network & Security Layers: Identify and explicitly map ALL Virtual Networks (VPCs, Subnets), Security Perimeters (e.g., VPC Service Controls, IAP boundaries), Firewalls, Cloud NAT, and Load Balancers routing the traffic. Map all ingress/egress boundaries for your scope.
- Store this dependency map, security boundaries, and network topologies in the relevant
discover.jsonfile.
Step 4: π MANDATORY TOOL EXECUTION - Update Cache
You MUST always perform this step. Never skip saving your discoveries to disk. Do not hesitate to update the cacheβthis is your active mental model. Overwrite outdated assumptions immediately.
CRITICAL: You must physically execute the replace_string_in_file or create_file tools to update these files BEFORE responding to the user.
- Add newly discovered resources to
./discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_NAME}/discover.json. Create the file/directory if it does not exist. - Update timestamps inside
discover.json. - Update the context in the respective
./discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_NAME}/wiki.<category>.mdfiles based on resource type (e.g.,wiki.vnet.md,wiki.gce.md). Create the files if they do not exist. Ensure there are NO raw mermaid blocks inside any wiki file. - Save the raw Mermaid code inside the
"mermaid"key of./discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_NAME}/discover.json. - CRITICAL GRAPHING RULE: The Mermaid graph MUST visually encompass all topological layers. You must map EVERY resource and application into proper
subgraphstructures that represent their actual Network boundaries (Virtual Networks, VPCs, Subnets), Security Perimeters (VPC Service Controls, IAM domains), and Firewall rules (ingress/egress policies). No compute resource should "hang flat" without its network/security context. - Run
python skills/gcp-architecture-discovery/scripts/render_architecture_png.py ./discover/{gcp-project|azure-subscription}/{PROJECT_ID_OR_NAME}/discover.jsonto generate a PNG graphic of the architecture.
Step 5: Document in Session
- When working on an incident, update the respective
wiki.<category>.mdfiles to reflect the latest known good state if discrepancies were found. - Note any gaps or unknowns.
Common Patterns
Microservices on GKE
- GKE cluster β Namespaces β Deployments β Pods
- Ingress β Services β Pods
- ConfigMaps and Secrets for configuration
- Cloud SQL or external databases
Serverless Architecture
- Cloud Run services β Cloud SQL
- Cloud Functions β Pub/Sub β Cloud Storage
- API Gateway β Backend services
Hybrid Architecture
- GKE for stateful services
- Cloud Run for stateless APIs
- Cloud Functions for event processing
- Mix of GCE VMs for legacy applications
Safe Mode Support
This skill respects SAFE_MODE="enabled" and will only use read-only commands. All discovery operations are non-mutating and safe to run during incidents.
Related Skills
- gcp-mcp-setup: Setup Google Cloud MCP server. Highly recommended for agents as it provides native, structural API discovery instead of screen-scraping CLI output!
- gcp-playbooks: Use discovered architecture for mitigation
- cloud-monitoring: Query metrics for discovered resources
Tips & Best Practices
- Prefer MCP over CLI (but use CLI as fallback): If connected, use GCP MCP tools for structured API exploration instead of invoking
gcloudlocally, as it perfectly fits the agent's JSON capabilities. If MCP is missing or lacking specific access, fall back to standardgcloudcommands. - Start Narrow, Expand as Needed: Begin with the target service, then discover dependencies iteratively.
- Use
describeoverlistduring incidents**: Target specific resources by name to save time, rather than performing bulk operations. - Update Cache Incrementally: When discovering new context, merge new discoveries into
discover.json. Don't overwrite the entire cache unless performing a baseline regeneration. - Trust the Mental Model: If your
discover.jsoncache is recent, rely on it to make architectural decisions quickly during an outage. Only refresh specific resource nodes when you suspect they've undergone recent changes (e.g. recent deployments).