name: gke-asm-lifecycle description: GKE Service Mesh (Cloud Service Mesh / ASM) install, upgrade, uninstall, and verification lifecycle. Use when tasked with installing/upgrading/uninstalling ASM or CSM on a GKE cluster, verifying whether ASM is currently installed (read-only checks), running pre-uninstall safety procedures, or generating a runbook for any destructive mesh operation. Covers both managed CSM (controlplanerevision/mdp-controller) and in-cluster (istiod) paths.
GKE Service Mesh Lifecycle
When to Load This Skill
- Verifying whether ASM/CSM is installed on a GKE cluster (read-only, safe to run)
- Installing or upgrading ASM (managed or in-cluster control plane)
- Uninstalling ASM (any path) — destructive, see "Runbook-First" pattern
- Authoring a runbook for any destructive mesh operation
- Diagnosing mesh-related issues during install/upgrade/uninstall
- Reviewing mesh architecture decisions (in-cluster vs managed, sidecar vs proxyless)
Knowledge Base (User-Specific Doc Index)
The user maintains a substantial ASM knowledge base at /Users/lex/git/gcp/asm/. This skill is a navigation layer over that base — load the relevant doc directly rather than re-deriving from upstream docs.
By Lifecycle Stage
| Stage | Doc | Size | Notes |
|---|---|---|---|
| Setup / Install | google-ams-setup.md |
85 KB | Main install reference. Most complete walkthrough. |
How-to-setup-istio-nosidecar.md |
33 KB | No-sidecar pattern (Gateway + HTTPRoute, no istio-proxy) | |
requirement.md |
4 KB | Pre-flight requirements | |
| Version selection | asm-version.md |
5.2 KB | Version compatibility matrix |
asm-think.md |
3.3 KB | Reasoning notes on version choices | |
| Uninstall | how-to-uninstall-service-mesh.md |
20 KB | Canonical runbook (read this first) |
| Troubleshooting | debug-gw-pod-start.md |
7.4 KB | Gateway Pod start failures |
how-to-resolve-grpc-config.md |
17 KB | gRPC + mesh config issues | |
status.md |
20 KB | Current status snapshot | |
| Testing | e2e-testing.md |
48 KB | E2E test suite |
| Architecture | gateways-type.md |
19 KB | Gateway type comparison |
summary.md / summary-gemini.md |
6 / 4 KB | High-level summaries | |
| Sub-areas | dp/ |
8 files | Dataplane-specific (sidecar config) |
netpol/ |
5 files | Network Policy integration | |
tls/ |
27 files | mTLS / certificate config | |
diagram/ |
4 files | Architecture diagrams |
By Topic Sub-Area
- Sidecar injection →
dp/+How-to-setup-istio-nosidecar.md - Network Policy →
netpol/ - mTLS / certificates →
tls/ - Gateway API / Ingress →
gateways-type.md
Official Source-of-Truth Docs
When the user's knowledge base conflicts with upstream, upstream wins (the user updates their notes from upstream):
- Uninstall: https://docs.cloud.google.com/service-mesh/docs/uninstall (last updated 2026-06-01)
- Install (managed): https://cloud.google.com/service-mesh/docs/managed/provision-managed-control-plane
- Install (in-cluster): https://cloud.google.com/service-mesh/docs/in-cluster/install
- Upgrade: https://cloud.google.com/service-mesh/docs/upgrade
- Fleet / Membership: https://cloud.google.com/anthos/fleet-management/docs
Critical Patterns
Runbook-First Destructive Ops (MANDATORY for uninstall/upgrade)
Pattern: For any destructive mesh operation, produce the full runbook first, then ask before executing.
- Stage the runbook with verification (read-only) → pre-flight safety → main ops → verification after
- Save as markdown in the user's knowledge base (e.g.,
/Users/lex/git/gcp/asm/<name>.md) - Present to user with explicit go/no-go options:
- "Run stage 1 only (read-only verification)"
- "Run all stages"
- "I'll run it myself, you standby"
- Wait for explicit consent before mutating the cluster
Why: Memory may contain prior constraints (e.g., "test cluster, do not log in casually"). User may have changed their mind, but you should surface the conflict — never silently override a prior safety note.
Template structure (15-stage pattern, see how-to-uninstall-service-mesh.md for canonical example):
| Stage | Purpose | Risk |
|---|---|---|
| 0 | Pre-flight (tools, auth, variables) | None |
| 1 | Verification (read-only) — 10+ checks | Zero |
| 2 | Pre-uninstall safety (downgrade mTLS, remove AuthzPolicy) | Very low |
| 3 | Disable fleet auto-management | Per-membership, reversible |
| 4 | Disable namespace sidecar injection | None |
| 5 | Restart workloads to remove sidecars | Disruptive |
| 6 | Delete webhooks | Low |
| 7 | Delete controlplanerevision (managed only) | Medium |
| 8 | istioctl uninstall --purge (in-cluster only) |
Medium |
| 9 | Delete namespaces (istio-system / asm-system) |
Medium (can stick) |
| 10 | Disable fleet mesh feature (fleet-wide, irreversible) | High |
| 11 | Cleanup managed data plane + submit Support case | Medium |
| 12 | Cleanup CNI residuals (configmap + daemonset) | Low |
| 13 | Cleanup Traffic Director (snk) | Low |
| 14 | Final 8-item verification (read-only) | Zero |
| 15 | RBAC cleanup + archive runbook | Low |
Each stage must have an explicit checkpoint in the runbook execution log table.
Verify-Before-Mutate (Always)
Before any uninstall/upgrade step, always run a read-only verification pass first. The 10-item checklist:
kubectl get ns | grep -E 'istio-system|asm-system'kubectl get pods -n istio-systemkubectl get crd | grep -E 'controlplanerevision|dataplanecontrol'kubectl get controlplanerevision -n istio-systemgcloud container fleet memberships list+gcloud container hub mesh describekubectl get ns --show-labels | grep -E 'istio-injection|istio.io/rev='kubectl get pods -A -o json | jq '...containers[].name | select(=="istio-proxy")...'kubectl get daemonset -n kube-system | grep -E 'istio-cni|snk'kubectl get configmap -n kube-system | grep istio-cni-plugin-configkubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations | grep istio
Summary decision matrix:
| Check | In-cluster | Managed | Not Installed |
|---|---|---|---|
istio-system ns |
✅ | ✅ | ❌ |
istiod-* Pod |
✅ | ✅ | ❌ |
controlplanerevision CR |
❌ | ✅ | ❌ |
mdp-controller (kube-system) |
❌ | ✅ | ❌ |
istio.io/rev= ns labels |
maybe | maybe | ❌ |
Pods with istio-proxy |
maybe | maybe | ❌ |
Managed vs In-Cluster Path Branching
The single most important question when uninstalling/upgrading: is this managed CSM or in-cluster ASM?
# Quick detector (run from bastion via IAP — see gcp-iap-tunnel skill)
kubectl get controlplanerevision -n istio-system 2>/dev/null
# Has output → managed CSM
# Empty + has istiod pods → in-cluster
# Empty + no istio-system ns → not installed
Each path has different steps:
- Managed CSM: stage 7 (controlplanerevision), stage 11 (mdp-controller + dataplanecontrol CR + Support case required)
- In-cluster: stage 8 (istioctl uninstall --purge), no Support case needed
Managed CSM Cleanup: Support Case OR gcloud hub mesh disable
For Managed CSM uninstall, the cloud-side cleanup is non-negotiable — without it:
istio-systemnamespace gets repeatedly recreated- Network Endpoint Groups (NEGs) are orphaned
- Uninstall falls back to "fail open" mode
There are two paths to signal "this cluster is done with CSM":
| Path | When to use | Reversible? | Speed |
|---|---|---|---|
| Google Cloud Support case (stage 11.4) | Prod fleets, multi-cluster, when fleet feature must stay enabled | Yes (Support can re-enable) | Days (Support SLA) |
gcloud container hub mesh disable (stage 10) |
Dev clusters, single-cluster fleets, when you're OK to lose fleet-level mesh feature | No (irreversible) | Immediate |
gcloud container hub mesh disable details:
# Disable the mesh feature for the entire FLEET HOST PROJECT
gcloud container hub mesh disable --project $FLEET_PROJECT_ID
# After this: gcloud container hub mesh describe returns "Service Mesh Feature is not enabled"
# This stops the Google fleet feature controller from recreating CRDs (see CRITICAL pitfall below)
When to pick which:
- Dev cluster, single fleet, no prod peers in same fleet host project → use
gcloud hub mesh disable(self-service, immediate, no human in the loop) - Prod cluster, multi-cluster fleet, other clusters use CSM → submit Support case (or
gcloud fleet memberships unregister <cluster>first to scope the impact, then disable) - Mixed fleet with other prod projects → ALWAYS Support case (the
gcloud hub mesh disablewould affect prod peers)
The Support case must include: project ID, cluster ID, uninstall timestamp, runbook URL, desired cleanup window.
Use IAP-Tunneled Bastion for Cluster Access
Cluster operations should run through a bastion via gcloud compute ssh --tunnel-through-iap. See gcp-iap-tunnel skill for:
- Output filtering (
grep -v 'WARNING:.*NumPy\|setlocale') - Region vs zone mixing
- Network troubleshooting when kubectl hangs — see "PREFERRED Fix: Re-fetch kubeconfig with --internal-ip" (this is the standard fix for managed-CSM uninstall on a private cluster with authorized networks)
- First-call latency
- Safety-block patterns to avoid (don't manually
kubectl --server=...rewrite; use gcloud-native flags instead)
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Runbook executed before user approval | Cluster mutated without consent | Always runbook-first, explicit go/no-go |
| Forgot to downgrade STRICT mTLS | Apps break during uninstall | Stage 2.1 — drop PeerAuthentication to PERMISSIVE first |
| Deleted namespaces with workloads still injecting | Restart loop, sidecars stuck | Stage 5 — restart workloads BEFORE deleting ns |
| istioctl not on bastion | command not found during stage 8 |
Download: curl -L https://istio.io/downloadIstio | sh - (only for in-cluster uninstall) |
kubectl get ns from bastion times out 90s+ |
Network path broken | gcloud get-credentials --internal-ip (see gcp-iap-tunnel "PREFERRED Fix" section) |
--zone used for regional cluster |
Empty results | Use --region for regional clusters (auto-detect pattern in gke-cluster-lifecycle skill) |
| Forgot cleanup for managed CSM | istio-system / CRDs come back | Stage 10 OR Support case (see "Managed CSM Cleanup" section above) |
cluster.x.x endpoint is private IP |
Bastion can't reach it | gcloud get-credentials --internal-ip (preferred) OR add bastion NAT IP to masterAuthorizedNetworksConfig |
| 🔴 CRD auto-recreation by Google fleet controller (Managed CSM) | After deleting controlplanerevisions.mesh.cloud.google.com and dataplanecontrols.mesh.cloud.google.com, they reappear within 30s with new CREATED timestamps |
Order matters: do gcloud container hub mesh disable (stage 10) BEFORE stage 11.5 (delete CRDs). Otherwise mdp-controller is gone but the higher-level Google fleet feature controller keeps re-installing them. Verify with a 30s sleep + re-check after final delete. |
image: auto placeholder in gateway deployment |
After removing injection label and rolling out, new Pod stuck ErrImagePull: image "auto" |
This is expected behavior during uninstall — image: auto is a literal placeholder that the sidecar injector webhook replaces at admission time. Once injection is disabled, kubelet tries to pull the literal auto:latest and fails. Don't try to fix it — the namespace will be deleted in stage 9 anyway. The OLD pod keeps running until then, so no traffic impact. |
| Support case prompt says it requires cluster ID, but I have cluster NAME | Support form rejects "cluster name" input | Cluster ID (numeric) ≠ Cluster Name (string). Get ID with: gcloud container clusters describe $NAME --format="value(id)" |
Tenant workloads in istio-injection=disabled namespace |
Look like they need sidecar handling but don't | They have no sidecars — restart, mTLS, AuthzPolicy all NO-OP. Verify with kubectl get pods -n <ns> -o jsonpath='{.items[*].spec.containers[*].name}' — if no istio-proxy, safe to skip stage 5. |
| Fleet membership name ≠ cluster name | gcloud container fleet mesh update --memberships <NAME> fails with "membership not found" |
Membership name is what gcloud container fleet memberships list shows, often different from cluster name (e.g., cluster dev-lon-cluster-xxxxxx registered as membership aibang-master). Discover first: gcloud container fleet memberships list --project=$FLEET_PROJECT_ID |
Stage Order Gotcha: Stage 10 Before Stage 11.5
The canonical 15-stage table lists stage 10 (disable fleet mesh) AFTER stage 11.5 (delete CRDs) because the official Google doc puts them in that order. In practice this order is wrong for managed CSM. The correct order is:
stage 11.1 delete mdp-controller
stage 11.3 delete dataplanecontrol CRs
stage 11.5 delete managed CSM CRDs ← will be RECREATED
stage 10 disable fleet mesh feature ← stop the recreator
stage 11.5 RE-delete managed CSM CRDs ← now they stay gone
Or equivalently, just do stage 10 before stage 11.5 in the first pass. The official doc order assumes a Support case will handle the cloud-side cleanup, which stops the recreator. If you use the self-service gcloud hub mesh disable path instead, do it BEFORE the final CRD delete.
Verification: after deleting CRDs the second time, sleep 30s and re-check. If they're still gone, the recreator is stopped.
Reference Files
references/2026-06-03-uninstall-session-notes.md— Real session notes from a Managed CSM uninstall on a private GKE cluster (IAP-tunneled bastion, regional, ISTIOD, nosidecar tenant). Captures the 7 most important discoveries: (1) the CRD auto-recreation trap and its root cause, (2) thegcloud get-credentials --internal-ipworkflow with the authorized-networks allowlist trap, (3) theimage: autoplaceholder behavior during rollout, (4) nosidecar-mode = 0-disruption uninstall, (5)gcloud hub mesh disablevs Support case decision matrix, (6) the final 8-item verification with 30s sleep, (7) cross-references to companion skills.
Related Skills
- gcp-iap-tunnel — SSH to bastion via IAP (essential for cluster access)
- gke-cluster-lifecycle — Cluster version/upgrade lifecycle (adjacent concerns)
- architectrue — Production architecture partner; this skill is mesh-specific companion