name: workload-identity description: Workload identity playbook (SPIFFE/SPIRE, cloud IAM, OIDC federation). Workflows for picking SPIFFE vs cloud IAM, deploying SPIRE on EKS, federating trust domains, bridging to AWS/GCP/Azure via OIDC, and migrating off long-lived workload secrets. Use when designing or hardening service-to-service auth for k8s, multi-cloud, or AI agents.
Workload Identity - Playbook
Companion rule: 318-workload-identity.mdc. This skill turns the principles into end-to-end workflows.
Scope: the infrastructure layer of identity. For the application layer (cross-app, agent-on-behalf-of-user), use skills/okta (XAA / ID-JAG) and skills/zero-trust (HITL gates, threat models).
When to invoke
Use when the user is:
- Designing service-to-service auth for a Kubernetes platform
- Choosing between SPIFFE/SPIRE and cloud IAM (IRSA, GCP Workload Identity, Azure Managed Identity)
- Deploying SPIRE on EKS / GKE / AKS (or evaluating Red Hat ZTWIM on OpenShift)
- Federating SPIFFE trust domains (multi-cluster, multi-cloud, or cross-org)
- Bridging SPIRE to cloud IAM via OIDC (AssumeRoleWithWebIdentity, GCP WIF, Azure federated credentials)
- Migrating workloads off long-lived secrets, instance profiles, or shared service accounts
- Sizing the operational SLOs for a workload-identity rollout
- Picking between SPIRE topologies (single-cluster, nested, federated, HA)
Golden Rules (anchor every decision)
- Attestation, not credentials. If the design ships a long-lived workload secret, restart.
- Trust domain per environment.
prod,staging,devare separate roots of authority. - Image-digest selectors beat image-tag selectors. Always.
- mTLS proves who. Authorization is separate. Pair with Istio AuthorizationPolicy / OPA / Lattice auth policy.
- TTL in hours for X.509-SVIDs, minutes for JWT-SVIDs. SDK auto-rotates.
- One substrate per workload, but the platform mixes both. SPIFFE for portable / k8s east-west; cloud IAM for cloud-native single-cloud.
- OIDC bridge collapses "give pods cloud creds" to "trust my SPIFFE issuer".
Workflow 1 - Pick SPIFFE vs Cloud IAM
The most common architecture decision in this space.
Steps
- Scope the choice: which workload, which trust boundaries, which clouds, which platforms.
- Score against the dimensions in references/spiffe-vs-cloud-iam.md:
- Portability needed?
- Attestation granularity required (image digest vs cloud role)?
- Operational capacity for SPIRE (HA, CA, observability)?
- Cloud-team IAM depth?
- Service-mesh strategy?
- Default rules:
- Inside one cloud, IAM-deep team, no portability need → cloud IAM (IRSA / VPC Lattice for AWS; Workload Identity for GCP; Managed Identity for Azure)
- Multi-cloud, k8s-heavy, ISV / portable platform → SPIFFE
- Most enterprises: both, segmented by workload class
- Decide and document in an ADR (don't fold this into a generic "design doc" - it's worth its own).
- Plan the OIDC bridge (Workflow 4) if SPIFFE is on one side and cloud creds on the other.
Deliverable: ADR with decision, scope, alternatives considered, and the boundary between SPIFFE and cloud-IAM-managed workloads.
See references/spiffe-vs-cloud-iam.md.
Workflow 2 - Deploy SPIRE on EKS
Production-ready SPIRE on EKS with image-digest selectors and Vault PKI as upstream CA.
Steps
- Trust domain decision. One per environment (
prod.example.com,staging.example.com,dev.example.com). Hard to change later. - SPIRE Server topology.
- HA: at least 2 replicas behind an internal NLB.
- Datastore: RDS Postgres (Multi-AZ), encrypted at rest, in a private subnet.
- UpstreamAuthority: Vault PKI (or AWS Private CA). SPIRE Server becomes an intermediate CA.
- SPIRE Agent.
- DaemonSet, runs as privileged for
/procand CRI socket access. - Node attestor:
aws_iid(EC2 instance identity document) with the cluster's OIDC issuer.
- DaemonSet, runs as privileged for
- Workload attestor.
k8splugin with namespace, service-account, and container image digest selectors.unixplugin as backup for non-k8s workloads on the node.
- Registration entries.
- Author via Terraform (
spire-controller-managerCRDs) orspire-server entry createin CI. - Selectors include
k8s:ns:<ns>,k8s:sa:<sa>,k8s:container-image:<image>@sha256:....
- Author via Terraform (
- Workload integration.
- Sidecar (Envoy + SDS) for polyglot workloads.
- Library (
go-spiffe,java-spiffe) for high-perf services. - Ambient mesh (Cilium / Istio Ambient) for encrypt-everything-east-west cases.
- Authorization on top.
- Istio AuthorizationPolicy with
source.principalmatching SPIFFE IDs. - Or Envoy ext_authz to OPA with Rego policy.
- Default-deny per service.
- Istio AuthorizationPolicy with
- Audit and monitoring.
- SPIRE Server audit log → SIEM.
- Metrics: SVID issuance latency, renewal failures, registration entry counts.
- Alerts: handshake failure spikes, agent attestation failures.
- Cutover plan.
- Phase 1: SPIRE deployed but not enforced; shadow-mode logging.
- Phase 2: Enforce per service (start with non-critical), measure error budget.
- Phase 3: Default-deny across the mesh.
- Disaster recovery.
- Datastore backup, restore runbook.
- Trust-bundle export for cross-region recovery.
- Rotation of upstream CA documented.
Deliverable: Terraform modules (SPIRE Server, Agent DaemonSet, registration), runbooks, dashboards, error-budget policy.
See references/spire-on-eks.md.
Workflow 3 - Federate Trust Domains
Multi-cluster, multi-cloud, or cross-org. Two trust domains exchange bundles so workloads on each side can verify the other's SVIDs.
Steps
- Define the federation graph. Which trust domains federate with which? Federation is bilateral; consider a hub-and-spoke vs full-mesh.
- Bundle endpoint setup. Each SPIRE Server exposes its bundle at an HTTPS endpoint, signed by a known certificate. Use a stable URL with rotation in mind.
- Bundle endpoint authentication.
https_spiffe- peer authenticates the endpoint via a SPIFFE ID (best for mature deployments).https_web- peer trusts the bundle endpoint cert via Web PKI (simpler bootstrap).
- Configure remote bundle on each Server.
spire-server bundle set -id spiffe://other.example.com -path .... - Federation policy. Mark registration entries as federated so SPIRE issues SVIDs that include the federated trust bundle.
- Authorization across federation. Istio / OPA policies must allow remote SPIFFE IDs explicitly. Don't trust by federation alone.
- Rotation and revocation. Bundles are fetched periodically (default 5 min). Plan for partial unavailability of the remote endpoint.
- Audit. Log federation events; alert on bundle fetch failures.
Deliverable: Federation topology diagram, bundle endpoint config, cross-domain authorization policies, runbook for adding/removing a federated peer.
Workflow 4 - OIDC Bridge to Cloud IAM
Trade a JWT-SVID for cloud creds. No static cloud key.
Steps
- Enable the SPIRE OIDC Discovery Provider. Exposes
.well-known/openid-configurationand the JWKS over HTTPS at a stable URL. - Per cloud:
- AWS - register the SPIRE OIDC provider in IAM (Identity Provider). Define a trust policy on the target role with
submatching the specific SPIFFE ID. Workload callsAssumeRoleWithWebIdentity. - GCP - configure Workload Identity Federation: workload identity pool + provider pointing to the SPIRE OIDC endpoint; pool member maps to a service account; workload calls the STS endpoint.
- Azure - federated credential on the User-Assigned Managed Identity referencing the SPIRE issuer + subject; workload calls the Azure AD token endpoint.
- Vault -
auth/jwtmethod with the SPIRE issuer; role binds bound claims to a Vault policy.
- AWS - register the SPIRE OIDC provider in IAM (Identity Provider). Define a trust policy on the target role with
- Trust-policy hardening. Always pin the
subclaim to a specific SPIFFE ID. Never trust the issuer alone. - Token TTLs. Cloud creds inherit the issuing service's defaults; tune to minutes.
- Audit. Log token exchanges on both sides; the cloud audit log is your record of who got what when.
- Rotation. SPIRE auto-rotates the JWT-SVID; cloud trust policy doesn't need rotation as long as the issuer URL and JWKS are stable.
Deliverable: OIDC provider config per cloud, trust policies pinned to SPIFFE IDs, audit dashboards, key rotation runbook.
Workflow 5 - Migrate Off Long-Lived Workload Secrets
The most-requested operational workflow.
Steps
- Inventory. Every workload secret in production. Sources: secrets manager, k8s Secrets, Helm values, env vars, mounted files, code.
- Classify by replacement substrate:
- Cloud-resource access from a cloud workload → cloud IAM (IRSA / Pod Identity / Workload Identity / Managed Identity). Eliminates the secret entirely.
- Service-to-service inside a cluster → SPIFFE SVIDs + mTLS, replace bearer tokens.
- Cloud access from outside the cloud (CI, on-prem) → OIDC federation (GitHub OIDC → cloud STS; SPIRE → cloud STS).
- Vendor / SaaS → OAuth client credentials with short-lived access tokens; rotate the client secret in vault.
- Migration order. Highest blast radius first (admin-scoped, multi-tenant, internet-exposed). Lowest second (analytics workloads, dev tools).
- For each workload:
- Provision the new identity primitive.
- Deploy code change to fetch creds from the new substrate.
- Verify in shadow mode (both old and new working).
- Cut over.
- Revoke the old secret.
- Confirm via audit that no caller is using the old secret.
- Decommission. Revoke unused secrets after a grace period; delete the secret store entries.
- Prevent regression.
- CI: pre-commit secret scanning (Gitleaks / TruffleHog / GitGuardian).
- PR template asks "is a new long-lived secret introduced?".
- SCPs / OPA policies block the creation of static IAM users in prod accounts.
Deliverable: Inventory + status board, per-workload migration runbook, decommission audit, regression prevention controls.
Workflow 6 - Cloudflare Access / AOP / Worker JWT Layering
Decision matrix for HTTP-edge identity. (Lifted from the broader Cloudflare context; see also 400-cloudflare.mdc.)
| Layer | Usually redundant | Still useful |
|---|---|---|
| Cloudflare Access | Re-validating the same IdP token, same issuer/audience/group already enforced by Access | Centralized edge authorization for coarse route access |
| Worker JWT validation | Repeating Access policy checks with no new claims | API-specific checks: audience, scope, tenant, route, method, fine-grained claim conditions |
| Origin validation | A second full interactive login after Access | Lightweight verification of Cf-Access-Jwt-Assertion, AOP/mTLS, service-specific authorization |
Rule of thumb: do not duplicate the same authentication decision. Add another layer only when it enforces a different boundary or finer-grained authorization.
For internal workloads (not internet-origin HTTP), this rule is the same: each layer must answer a different question, or it's not pulling its weight.
Anti-patterns (reviewers call these out)
- EC2 instance profile as the identity for high-density shared compute.
- One trust domain across environments.
- Long-lived JWT-SVIDs (defeats rotation).
- mTLS without an authorization layer ("we have mTLS so we're done").
- Identity in the cert CN instead of the URI SAN.
- Hand-distributed trust bundles instead of Workload API.
- Treating SPIFFE and cloud IAM as complements for the same workload.
- Self-signed SPIRE root in prod (no UpstreamAuthority).
- Image-tag selectors instead of digest.
- Assuming TTL is sufficient when sub-second revocation is required.
Review Output Format
[BLOCKER] <one-line summary>
Principle: <which golden rule>
Evidence: <file:line, ADR ref, or runtime evidence>
Why it matters: <blast radius / attack path>
Fix: <specific, actionable>
[IMPORTANT] <...>
[SUGGESTION] <...>
BLOCKER = violates a golden rule (e.g., long-lived workload secret introduced; mTLS without authz; selector without image digest).
IMPORTANT = misalignment with 318-workload-identity.mdc.
SUGGESTION = tightens posture / improves operability.
References
- references/spiffe-vs-cloud-iam.md - the decision matrix in depth
- references/spire-on-eks.md - SPIRE Server + Agent on EKS, Vault PKI upstream
- references/oidc-federation-bridges.md - SPIRE → AWS / GCP / Azure / Vault
Related
- Rule:
318-workload-identity.mdc(the principles) - Rule:
316-zero-trust.mdc(always-on) - Rule:
317-okta.mdc(Okta + XAA at the application layer) - Rule:
412-aws-iam.mdc(IRSA, Pod Identity, VPC Lattice) - Rule:
450-kubernetes.mdc(mesh patterns) - Skill:
zero-trust(threat models, HITL, MCP hardening) - Skill:
aws-iam(AWS IAM operational patterns) - Skill:
okta(Okta operations, XAA workflows)