gitops

name: gitops description: "Use this skill for PittampalliOrg/stacks GitOps operations: ArgoCD app health/drift review across hub/dev/staging/ryzen; argocd-agent spoke registration/status aggregation; image promotion; release-pins, GHCR image drift, and image pins outside release-pins; SWE-bench evaluator image/env rollout and canary validation; workflow-builder spoke runtime drift; workflow-builder prompt-preset DB migrations and API smoke tests; workflow-builder MCP/auth and ActivePieces piece-runtime (MCP + activity execution) services; Dapr agent-runtime statestore, runtime-registry-driven runtime image/env pins (dapr-agent-py / claude-agent-py / browser-use-agent), sidecar readiness, and 1/2 pod recovery; GitOps Promoter stuck apps, env branches, source-hydrator, and hub promotion; Tailscale ACLs, device-backed Ingress DNS/status, ProxyGroup service-host VIPs, spoke API access, stale tailnet devices/services, and Funnel webhooks; OAuth/secret rotation, deployment inventory, workflow JSON DB upserts, and app placement."

GitOps for PittampalliOrg/stacks

Operational knowledge for the hub-and-spoke gitops system across dev, staging, ryzen (local Talos Docker, autonomous argocd-agent spoke), and hub (Talos control plane). Read this whole file, then drill into reference/ or runbooks/ based on the decision tree.

Orientation

Hub is a Talos cluster on Hetzner. It runs a single ArgoCD that manages itself and all spokes via cluster secrets.
Control plane = argocd-agent v0.8.1. The hub runs the PRINCIPAL (single pane, ns argocd); each spoke runs a LOCAL ArgoCD + an agent dialing the principal OUTBOUND over tailnet mTLS (8443). dev = MANAGED agent (hub authors Application objects in ns dev, the principal pushes them to the dev agent; observe via kubectl -n dev get applications on the hub). ryzen = AUTONOMOUS agent (reconciles its own apps; hub aggregates status). Sync OPERATIONS run on the SPOKE's local controller, so the hub pane shows sync+health but NOT operation lifecycle — "Unknown operation status" on the hub is architectural/benign. For migrated spokes the cluster-<spoke> Secret is now an AGENT MAPPING (server=...?agentName=<x> + embedded mTLS certData/keyData, NO bearerToken). See reference/architecture.md ("Spoke registration"), cluster-desired-state for the end-to-end recreate path, and cluster-desired-state/references/tailscale-and-certs.md for cert-avoidance detail.
Spokes: dev, staging (script-provisioned Talos on Hetzner — Crossplane removed in Phase D), and ryzen (imperatively-bootstrapped local Talos-in-Docker on the user's workstation). All three run a LOCAL ArgoCD + an argocd-agent dialing the hub principal. dev/staging = MANAGED (hub authors their Applications and pushes them down). ryzen = AUTONOMOUS — it runs a LOCAL ArgoCD whose root-ryzen app-of-apps reconciles packages/overlays/ryzen @ main DIRECTLY (the ryzen-* Applications live on ryzen's own cluster, NOT on the hub; the agent push-mirrors their status up to hub ns ryzen). Ryzen has no local Gitea and no idpbuilder — GitHub + GHCR only.
Hub Tekton owns the build plane:
- Outer-loop lane (github-outer-loop EL): auto-builds + promotes ALL Skaffold-owned services to dev on merge to main — NOT just workflow-builder. The hub Tekton EventListener github-outer-loop has per-service triggers (CEL filter: a commit touching services/<svc>/**, or a [build all] commit message). A merge that touches services/<svc>/ fires THAT service's trigger → the parameterized, service-agnostic outer-loop-build Pipeline builds the image → ghcr.io → the update-stacks task pins the SHARED release-pins/workflow-builder-images.yaml (one file holds EVERY service's pin) + regenerates the dev overlay packages/components/workloads/workflow-builder-system-overlays/dev/kustomization.yaml. VERIFIED end-to-end 2026-06-05 for workflow-builder, workflow-orchestrator, function-router, mcp-gateway, swebench-coordinator (the prior "only workflow-builder/swebench-inference auto-builds" belief was WRONG — the per-service triggers had simply never been exercised, so they looked dead). workflow-builder's own trigger fires on src/, lib/, scripts/, static/, drizzle/, Dockerfile, package.json changes. The renderer default is dev-only (staging dormant — stacks PR #2437 flipped WFB_RENDER_ENVS default dev staging → dev; re-enable with WFB_RENDER_ENVS="dev staging"). github-outer-loop deliberately does NOT touch ryzen's pin (ryzen is the Skaffold commit-pin lane). Current pipelines may push the release metadata commit directly to origin/main; older/alternate pipelines may open a release/workflow-builder-* release-intent PR. Inspect update-stacks logs and branch/PR state before assuming the handoff. A release metadata + generated overlay commit on origin/main drives dev.
  - update-stacks push retry now has backoff (stacks #2455). The task's git push origin main retry is 6 attempts at 4/8/12/16/20s with a rebase between (was 3 tries in ~1s with NO backoff, which DROPPED a build's promotion on a transient GitHub 500 / push contention — e.g. a build racing a concurrent merge). Transient push failures now self-heal; this was the only real gap in the per-service auto-build.
  - Bring a STALE service current without a source change. Because the per-service trigger only fires on a services/<svc>/ change, a service can stay frozen at its last successful build indefinitely. To re-pin from current main HEAD, create a PipelineRun from the outer-loop-build Pipeline with params git_url=https://github.com/PittampalliOrg/workflow-builder.git, git_sha=<current main HEAD>, image_name=<svc>, dockerfile=services/<svc>/Dockerfile, context=. (Node: function-router/mcp-gateway) or services/<svc> (Python: workflow-orchestrator/swebench-coordinator), + workspaces shared-workspace (emptyDir), dockerconfig (secret ghcr-push-credentials), buildah-cache (PVC buildah-cache-<svc>). Per-service image_name/dockerfile/context come from the outer-loop-<svc> TriggerBinding. It builds from current main → update-stacks re-pins dev. (Used 2026-06-05 to bring mcp-gateway/swebench-coordinator/function-router current.)
- Ryzen local delivery: source changes still land through the normal app repo / GitHub path, and current image delivery should use GHCR tags or explicit stacks pins. Ryzen manifest updates are delivered by committing to main — ryzen's LOCAL ArgoCD (root-ryzen @ main) re-renders packages/overlays/ryzen directly (no hub Source Hydrator, no Promoter, no env/spokes-ryzen on the ryzen lane).
- SWE-bench inference image lane: submitted by workflow-builder/swebench-coordinator preflight as swe-env-<envSpecHash-prefix> PipelineRuns on hub Tekton. These validate and pin repo/version/base images into SWEBENCH_INFERENCE_ENVIRONMENTS_DIR ConfigMap data. If a dev SWE-bench run sits queued while a swe-env-* PipelineRun exists on hub, the run is waiting for hub environment validation; do not look for Buildah pods on dev. The supported lane is the organic harness-generated image path; stale Epoch/prebuilt experiment rows or PipelineRuns must not satisfy exact-ready selection unless a fresh compatibility canary proves that strategy.
- A workflow-builder app push that touches services/dapr-agent-py normally fires three runtime image builds: dapr-agent-py-image-build, dapr-agent-py-sandbox-image-build, and dapr-agent-py-testing-sandbox-image-build. It can also fire the workflow-builder image build from the same commit. Watch the GitHub/GHCR outer-loop PipelineRuns and release metadata before deciding a rollout is complete.
- A workflow-builder app push that touches services/claude-agent-py is the Claude Agent SDK peer-runtime path. Verify the built ghcr.io/pittampalliorg/claude-agent-py-sandbox:git-<sha> tag, the workflow-builder Deployment env AGENT_RUNTIME_CLAUDE_DEFAULT_IMAGE, and the live BFF pod env before declaring Claude runtime rollout complete.
- For SWE-bench infra work, a dapr-agent-py change is not live until dev runs the matching sandbox/testing sandbox images and the BFF sees the updated AGENT_RUNTIME_*_DEFAULT_IMAGE values. A recent scoped-activity fix validated this path with workflow-builder commit 0180f081 and a clean 25-instance dev run; use the pattern, not the SHA, as the rule.
- Old spoke-local build apps such as workflow-builder-builds-local and gitea-builds-egress should stay removed. If you see them live, treat them as stale/orphaned unless a new design explicitly reintroduces them.
GitOps Promoter gates hub and spoke promotions. No staging cluster currently — promotion is ryzen (direct main) + dev only. workflow-builder-release gates promotion to dev through argocd-health plus the timer gate from TimedCommitStatus-workflow-builder-soak.yaml (dev 0s), autoMerge: true. The env/spokes-staging environment + its 10m soak were dropped (stacks PR #2436); staging is dormant — the cluster-staging Secret, env/*staging* branches, the staging overlay, and the spoke-workloads-appset staging entry are kept in place for fast re-enable (re-add the env to PromotionStrategy-workflow-builder-release.yaml + TimedCommitStatus-workflow-builder-soak.yaml to restore). stacks-environments gates hub self-management from env/hub-next → env/hub.
ArgoCD v3.4 + Promoter v0.30 (May 2026). Bumped from 3.3.9/v0.27.1. v3.4 has stricter ServerSideApply that surfaces operator-injected drift previously hidden — most commonly seen on Tekton Pipelines/Tasks (mutating webhook adds computeResources: {}, metadata: {}, etc.) and Knative Services (terminationGracePeriodSeconds requires a feature flag). Use ignoreDifferences with jqPathExpressions covering the operator-injected paths; see runbooks/review-argocd-app-health.md. Also enabled the web-based terminal (exec.enabled=true) — Pods now have a Terminal tab in the UI.
ArgoCD Promoter UI extension is installed on hub ArgoCD so operators can visualize PromotionStrategy, ChangeTransferPolicy, PullRequest, and related Promoter CRDs in the ArgoCD UI.
Source-hydrator renders packages/overlays/<spoke> → env/spokes-<spoke>-next; promoter merges to env/spokes-<spoke>; the hub principal pushes the generated spoke Applications down to the managed agents. This applies to dev/staging only — ryzen is NOT on the source-hydrator/Promoter path (it reconciles packages/overlays/ryzen @ main directly via its own local ArgoCD; there is no env/spokes-ryzen).
Deployment inventory is generated on the hub by gitops-deployment-inventory. Browser/human access uses the HTTPS service-host VIP gitops-inventory-hub.tail286401.ts.net; spoke workflow-builder pods use a separate node-backed Tailscale LoadBalancer gitops-inventory-hub-node.tail286401.ts.net:8080 through the in-cluster egress Service gitops-inventory-hub-egress.tailscale.svc.cluster.local:8080. The generator now lists Argo Applications cluster-wide (was ns=argocd only) so dev/staging health populates instead of showing "Unknown" (#2445). The GitOps pipeline image-history carousel needs a GITHUB_TOKEN on the workflow-builder pod (else GitHub's 60/hr unauthenticated limit empties loadPinHistory) — wired via the workflow-builder-secrets ExternalSecret reading KV ryzen-shared-secrets/GITHUB-PAT (#2444).
Promoted spoke hostnames are declarative. Dev/staging workflow-builder system URLs live in spoke-workloads-appset.yaml, tailnet exposures live under packages/base/manifests/tailscale-ingresses/ (device-backed Ingresses like phoenix-* plus the workflow-builder L4 LoadBalancer Service-workflow-builder-tailnet.yaml), and policy.hujson is reserved for tailnet policy such as real svc:* service-host approvals, device tags, Funnel grants, and Kubernetes grants.
OpenShell depends on upstream kubernetes-sigs/agent-sandbox CRDs. Keep agent-sandbox-crds / <spoke>-agent-sandbox-crds; it owns required Sandbox, SandboxClaim, SandboxTemplate, and SandboxWarmPool CRDs. The custom AgentRuntime CRD + the Kopf agent-runtime-controller are RETIRED — runtimes are now per-session agent-sandbox Sandbox pods (Kueue-admitted) selected by image, with browser-use-agent on a SandboxWarmPool carve-out. AutoKube is legacy and has been removed.
Workflow-builder MCP/auth is a DB-backed runtime path, not just static manifests. Project MCP rows in workflow-builder's mcp_connection table bind ActivePieces pieces to app_connection.external_id credentials. The activepieces-mcps app is an all-catalog reconciler (every 2 min) that provisions per-piece Knative ap-<piece>-service services from enabled mcp_connection rows + pinned pieces, plus an activepieces-mcp-catalog ConfigMap. Each ap-<piece>-service is the converged piece-runtime — ONE piece-mcp-server image parameterized by PIECE_NAME, serving POST /execute (deterministic workflow activities — fn-activepieces was deleted), POST /mcp (agent MCP tools), POST /options (canvas dropdowns), and GET /health. Credential reference-forwarding applies to BOTH /mcp tools and /execute activities (callers forward X-Connection-External-Id; the piece-runtime self-resolves via the BFF decrypt). At launch the BFF resolves the agent's MCP servers from its agentConfig.mcpServers and passes them into the per-session runtime as DAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON (claude-agent-py now consumes MCP too) — there is no per-agent CR or controller injecting them.
Sandboxed Dapr agents use centralized Dapr state. workflowstatestore is the namespace-wide workflow/actor state store for parent workflows, child session workflows, timers, reminders, and activity bookkeeping. dapr-agent-py-statestore is namespace-wide too, but actorStateStore=false; it is the agent application state API store. Do not create per-agent state stores or move durable history into pod-local state.
SWE-bench evaluator rollout is a workflow-builder image-promotion path plus an env-var check. The evaluator image is built from the workflow-builder repo (services/swebench-evaluator) and promoted through release-pins, but swebench-coordinator launches Jobs from SWEBENCH_EVALUATOR_IMAGE. The base kustomize has a replacement from local-config Pod/swebench-evaluator-image into that env var; verify the live Deployment env after promotion instead of assuming images: alone rewrote it.
Claude runtime rollout is workflow-builder image-promotion plus a BFF env-var check. claude-agent-py-sandbox is surfaced in the workflow-builder GitOps pipeline UI service matrix and is consumed through AGENT_RUNTIME_CLAUDE_DEFAULT_IMAGE, not by the workspace sandboxTemplate name. Also verify CLAUDE_AGENT_PY_DEFAULT_MODEL=claude-opus-4-8 and the workflow-builder model key anthropic/claude-opus-4-8 when changing defaults.

The two image-pin systems for the same workflow-builder base are the most common source of confusion. Read reference/architecture.md first if you've never seen this setup.

Event-driven activity stream (Argo Events → workflow-builder GitOps pipeline UI)

The whole delivery system is observable LIVE in workflow-builder at /admin/gitops/system (the "Kargo lens" pipeline view), fed by Argo Events on the hub. This is the fastest way to watch a build/promotion/sync land across ryzen + dev in real time.

Producers — stacks packages/components/hub-management/manifests/gitops-activity-events/ (ArgoCD app apps/gitops-activity-events.yaml): an Argo Events EventBus + three resource EventSources watch hub CRs — ArgoCD Applications (EventSource-gitops-argocd), Tekton PipelineRun/TaskRuns (-tekton), GitOps-Promoter CRs (-promoter). Matching Sensors (Sensor-gitops-{argocd,tekton,promoter} + -inventory-refresh) fire on each change and HTTP-POST a normalized event to workflow-builder's internal ingest endpoint. The argo-events-ui manifests expose the raw Argo Events dashboard.
Ingest — workflow-builder BFF: POST /api/internal/gitops/events/ingest → src/lib/server/gitops/activity-events.ts classifies source (tekton/promoter/argocd), normalizes resourceRef, extracts a correlation map (imageName, imageRef, gitSha, argocdApp, syncRevision/syncStatus/healthStatus, branch, hydratedSha, drySha, PR, commitStatusKey…), computes a deterministic eventId, and appends to the gitops_activity_events table with a monotonic sequence.
Consumer — the UI: /admin/gitops/system opens an SSE stream GET /api/v1/gitops/events/stream?since=<seq> (sequence-resume + exponential-backoff reconnect + fallback poll only while disconnected). src/lib/gitops/activity-overlay.ts CORRELATES events onto the pipeline model — Tekton by imageName/imageRef/gitSha, Promoter → the release-pins bundle stage by branch/hydratedSha, ArgoCD by <env>-<app> name. The overlay attaches activity ONLY; it never mutates the authoritative inventory (health/sync) snapshot. Graph nodes / list / per-node drawer render event-first (failed/passing/active/neutral tone from one shared activity-tone.ts, freshness from a shared clock, a brief node + edge "flow" pulse on each incoming batch); the raw firehose is behind ?debug=1.
Reading it during a deploy: the header build <sha> badge shows the image THAT cluster's UI pod is itself running, so it doubles as the per-cluster delivery proof — build <ryzen-sha> on workflow-builder-ryzen.tail286401.ts.net, the promoted sha on the dev URL. Inventory/health remains the authoritative recovery snapshot; the event stream is the live overlay on top of it.
Build feedback + delivery timeline (INVENTORY-sourced, NOT event-sourced — 2026-06-05). The pipeline view also surfaces image-build status and the whole Commit→Build→Pin→Promote→Deploy chain, but it reads these from the hub inventory, not the Argo-Events stream. Why: the activity stream is ~100% ArgoCD (Tekton events are buried out of any practical window), whereas gitops-deployment-inventory already aggregates environments[].applications[].build = {pipelineRun, status, reason, startedAt, finishedAt} (+ desired/promotion/live/drift) per app, refreshed ~15s. So the model threads build (→ a build chip on each stage card: Built/Building/Failed + duration + a Tekton PipelineRun deep-link to ns tekton-pipelines) and imageHistory provenance (commit/pin) — no Tekton TaskRun triggers were added; the Tekton EventSource is unchanged. The node drawer's Delivery timeline shows inter-step gaps (↓ +1m), phase durations (build, soak), a commit→live lead-time header, and a single "live since" on Deploy — a per-row "N mins ago" would collapse to one value because the automated outer-loop runs as one sub-minute burst (durations/gaps carry the signal instead). Lane-aware Promote: dev shows the promoted hydrated sha + soak/gate; ryzen shows "direct to main · no Promoter gate" (the pin IS the promotion). Data quirk to know: imageHistory.committedAt is actually the pin-commit time (from the stacks release-pins git log), so Commit≈Pin and the lead-time anchors on build.startedAt (the genuinely-earliest event). Lesson for any fast automated pipeline: show durations/gaps/lead-time, not repeated absolute timestamps.
App-wide deployment notifications (toast + sidebar bell — 2026-06-05). Beyond the pipeline page, workflow-builder now notifies on EVERY authenticated page when an image actually REPLACES a deployment (a component's live image tag changes on a cluster — the "your change is live" signal): a svelte-sonner toast + a sidebar notification bell (unread badge + localStorage history). It's admin-gated (the inventory/SSE endpoints require platform-admin) and INVENTORY-DIFF-driven (same philosophy as the build feedback above): a singleton store (src/lib/stores/deployment-notifications.svelte.ts, started once from the root layout onMount) baselines each env:component's SET of live image tags from /api/v1/gitops/deployment-metadata and fires when a genuinely NEW tag appears while Synced. The gitops SSE stream is only a debounced re-check trigger; a 25s poll is the fallback. Detection gotcha (verify if you touch it): the inventory's live.images mid-rollout holds BOTH the old AND new component tags (old+new ReplicaSet pods coexist) and desired.image is a full ref WITH a tag, not a bare repo — so use a tag-SET diff (current − baseline), not a single "current tag", or the old tag wins and nothing fires. See the [[project_app_wide_deploy_notifications]] memory. To test a notification without a full build, commit-pin a service on ryzen to an existing GHCR tag (e.g. SKAFFOLD_IMAGE=ghcr.io/pittampalliorg/<svc>:git-<sha> bash skaffold/hooks/commit-pin.sh <svc>) — the live-image change fires it.

The "which file?" matrix (single most-referenced piece of knowledge)

Cluster	Image source	Bump path	Branch the bump lands on
ryzen	`packages/components/workloads/<image>/manifests/kustomization.yaml` (`images:` block) or release-pinned GHCR refs for shared workload images	Edit stacks, commit/merge to `main` (or run `commit-pin.sh`, which now pushes to `main`). Ryzen's LOCAL ArgoCD (`root-ryzen` @ `main`) re-renders `packages/overlays/ryzen` directly; hard-refresh the affected `ryzen-*` apps (or `deployment/scripts/ryzen-sync.sh`) to skip the poll interval	GitHub `PittampalliOrg/stacks` `main` (no `inner-loop`, no `env/spokes-ryzen` — both retired)
dev / staging	`packages/components/hub-spoke-appsets/release-pins/workflow-builder-images.yaml` (`images` compatibility tags plus `digests`, `imageRefs`, `sourceShas`, `pipelineRuns`, `updatedAts`) rendered into dry-source overlays at `packages/components/workloads/workflow-builder-system-overlays/{dev,staging}/kustomization.yaml`	Hub Tekton outer-loop `update-stacks` writes release metadata and regenerates overlays; observed current path can push directly to `origin/main`, while PR-mode opens `release/workflow-builder-*`. Manual changes must update/validate the same metadata and overlays	`origin/main` release metadata commit, or `release/workflow-builder-*` PR branch → `origin/main` when PR mode is active
hub itself	source-hydrator from `packages/overlays/hub` on `origin/main` → `env/hub-next` → `env/hub` (gated by `stacks-environments` PromotionStrategy)	Edit overlay; merge to `origin/main`	`origin/main` (GitHub)

The static dapr-agent-py pool Deployment (backing the openshell-durable-agent enum + the agent-runtime-pool-coding benchmark pool) is a third path: bumped via its workloads image pin (no per-cluster override for the manifest), so a single bump applies to all spokes once it's on origin/main. (The old custom agent-runtime-controller Kopf operator that materialized per-agent Deployments is retired — there is no Deployment-agent-runtime-controller.yaml to bump.)

The ryzen row's bare-images: mechanism above holds for most Skaffold-owned services (workflow-orchestrator, function-router, mcp-gateway — commit-pin edits their newTag directly), but workflow-builder + workflow-mcp-server are an exception since C1 (2026-06-04): their bare images: block was deleted; ryzen's pin now lives in the render-generated Component packages/components/workloads/workflow-builder-ryzen-image/kustomization.yaml, which their workloads kustomization components:-includes. commit-pin for those two writes the FLAT file packages/components/hub-spoke-appsets/release-pins/workflow-builder-images-ryzen.yaml AND renders + commits the Component LOCALLY in the same push (wfb PR #37); stacks CI (.github/workflows/render-ryzen-image.yml) is just a drift-correction safety net that re-renders only on a diff. See the "Workflow-builder image pin has two visible truths" gotcha for the full flow.

release-pins/workflow-builder-images.yaml is the image-pin source for promoted dev/staging workflow-builder-system child Applications, but it is not applied directly by the ApplicationSet. It is rendered into the dry-source overlays with scripts/gitops/render-workflow-builder-release-overlays.sh, and source-hydrator reads those overlays. Manual release-pin edits must run the renderer or scripts/gitops/validate-workflow-builder-release-pins.sh will fail the overlay freshness check.

Do not add release-pin lookups back into spoke-workloads-appset.yaml. Argo CD source-hydrator caches by dry-source commit; when rendered output depends on ApplicationSet generator values outside the dry source, a controller race can hydrate the right dry SHA with stale inline values and then keep reusing that hydrated commit. Keep release-pin-derived images/env in the generated dry-source overlays instead.
Runtime images (browser-use-agent-sandbox, dapr-agent-py-sandbox, claude-agent-py-sandbox) are read from AGENT_RUNTIME_*_DEFAULT_IMAGE env vars on the workflow-builder Deployment (packages/components/workloads/workflow-builder/manifests/Deployment-workflow-builder.yaml) or launch-specific runtime config; the runtime registry SSOT (services/shared/runtime-registry.json) maps each runtime to its imageEnvKey. kustomize.images substitutes container image: fields but not env var values, so release-pins bumps don't touch these. Bump the env var and verify the live BFF pod sees it — the BFF reads it per session when it spawns the agent-sandbox Sandbox pod, so the next session uses the new image (no per-agent AgentRuntime CR to patch; the CRD + controller are retired). For the static dapr-agent-py pool, roll its Deployment. See runbooks/bump-image-pin-not-in-release-pins.md.

Ryzen is an AUTONOMOUS argocd-agent spoke with its OWN local ArgoCD; no local Gitea, no idpbuilder (GitHub + GHCR only). Ryzen-affecting manifest changes flow through GitHub main:

Ryzen-only image-tag bumps (Skaffold outer-loop): commit-pin.sh pushes to GitHub main. Ryzen's LOCAL ArgoCD (root-ryzen @ main) re-renders and applies the bump — NO source-hydrator, NO Promoter, NO env/spokes-ryzen for this path (all retired for ryzen).
Manifest changes affecting hub itself (cluster Secrets, ApplicationSet definitions, headlamp Service annotations): commit to GitHub main. Hub Source Hydrator hydrates packages/overlays/hub to env/hub-next. GitOps Promoter then creates env/hub-next → env/hub PRs that MUST be merged for the change to take effect.
Manifest changes affecting dev/staging workload-layer: commit to main. Source-hydrator renders packages/overlays/<spoke> to env/spokes-<spoke>-next; Promoter gates the env/spokes-dev-next → env/spokes-dev step; the principal then pushes the Applications to the managed dev/staging agents. (Ryzen is NOT on this path — it reconciles overlays/ryzen @ main itself.)

Hub→ryzen kube-api reach (the ryzen Tailscale operator's apiserver-proxy SNI path, or the ryzen host raw-TCP tailscale serve --tcp=6443 passthrough) is RETIRED as the ArgoCD sync path — under argocd-agent ryzen reconciles its own apps locally and the agent dials the hub principal OUTBOUND (8443). The hub→ryzen kube endpoint now exists ONLY for Headlamp (the host-passthrough endpoint in the dedicated headlamp.dev/cluster=true Secret). The cluster-ryzen Secret is an AGENT MAPPING (server=https://argocd-agent-resource-proxy:9090?agentName=ryzen + embedded mTLS, NO bearerToken), not a kube-API endpoint. See reference/architecture.md ("Control plane: argocd-agent v0.8.1" and "Spoke registration"), cluster-desired-state for the recreate path, and the ryzen-spoke-bootstrap skill's references/failure-modes.md.

For hot-loop regression checks, use deployment/scripts/benchmark-ryzen-hot-edit.sh with BENCHMARK_PURPOSE=normal|manual|threshold-test and BENCHMARK_CASE=child-service|app-definition|dependency-file. The app-definition case uses a source-only child Application marker so it exercises root/app-definition planning without leaving live Application fields behind. The summary command defaults to --purpose normal and excludes failed threshold-test reports; use --purpose all --include-failures when auditing full history.

MCP/auth has a third, non-image flow. mcp_connection and app_connection rows live in the workflow-builder DB; activepieces-mcps reconciles those rows into Knative services; at session launch the BFF resolves the agent's MCP servers and passes them into the per-session runtime pod as DAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON (read per-session — no per-agent CR caches it). A source push alone does not fix an agent whose stored MCP config is stale; fix the mcp_connection/agentConfig.mcpServers rows, then launch a fresh session and verify the runtime pod env and logs show the expected servers.

Decision tree

"I need to roll out / promote / bump an image"

Which cluster? ryzen only → update the workloads manifest or GHCR image pin in stacks, then commit/merge to main (or run commit-pin.sh, which pushes to main) — ryzen's local ArgoCD picks it up. If the image does not exist yet, push the app repo to origin/main so the normal GitHub/GHCR outer-loop builds it first.
dev or staging → normal path is hub Tekton outer-loop builds GHCR and update-stacks writes tag, digest, provenance, and generated dry-source overlays. Read the task logs: if it pushed directly to origin/main, track source-hydrator + Promoter from that commit; if it opened a release/workflow-builder-* PR, review/merge it first. Manual path: edit all release metadata maps in release-pins/workflow-builder-images.yaml, run scripts/gitops/render-workflow-builder-release-overlays.sh, verify the GHCR tag/digest, run scripts/gitops/validate-workflow-builder-release-pins.sh, then follow runbooks/promote-image-to-spokes.md.
Want dev/staging to use an image you validated on ryzen → use the GHCR tag/digest as the promoted artifact, then bump release-pins and generated overlays. Legacy Gitea registry mirroring is recovery-only; it is not the normal source of promoted images.

"I pushed workflow-builder and need to verify ryzen + dev"

Confirm the app repo commit is on origin/main so the hub Tekton outer-loop builds the new ghcr.io/pittampalliorg/workflow-builder:git-<sha> image.
Ryzen: for fast iteration, commit-pin.sh already pushed the new tag to GitHub main — ryzen's LOCAL ArgoCD picks it up automatically. Verify with kubectl --context admin@ryzen get application ryzen-workflow-builder -n argocd (should be Synced/Healthy) and confirm the live Deployment image on ryzen via kubectl --context admin@ryzen get deploy workflow-builder -n workflow-builder -o jsonpath='{.spec.template.spec.containers[0].image}'. During active skaffold dev sessions the Skaffold-owned dev pod may serve live traffic from synced source while the local ArgoCD app is paused (skip-reconcile); inspect the live pod before assuming the image rollout is what users hit.
Dev: watch hub outer-loop-workflow-builder-*; capture the built GHCR tag/digest and read update-stacks logs. The task may push release metadata directly to origin/main or open a release PR depending on the active pipeline.
Track spoke-dev-workflow-builder.status.sourceHydrator.currentOperation.{drySHA,hydratedSHA}, the workflow-builder-release-env-spokes-dev-* ChangeTransferPolicy, and dev-workflow-builder / spoke-dev-workflow-builder health. If env/spokes-dev-next advanced but the CTP still proposes the older dry SHA after one source-hydrator poll, annotate PromotionStrategy/workflow-builder-release and the dev CTP with fresh promoter.argoproj.io/refresh-ts.
Finish with authenticated smoke tests against the public URLs. For schema-affecting workflow-builder changes, verify the db-migrate hook applied the expected migration before trusting the UI. For Prompt Workbench/preset changes, confirm resource_prompt_versions exists and run an authenticated /api/prompt-presets list/create/update/archive smoke. On NixOS, if Playwright's bundled browser cannot launch, use system Chrome at /etc/profiles/per-user/vpittamp/bin/google-chrome.

"I edited stacks manifests and need ryzen to pick them up"

Decide the right path:
- Ryzen-only image-tag bump (typical after skaffold run): commit/merge to main via commit-pin.sh. Ryzen's LOCAL ArgoCD (root-ryzen @ main) reconciles it directly — no inner-loop, no env/spokes-ryzen, no Promoter.
- Hub-affecting change (cluster Secret, ApplicationSet, Tailscale Service annotation): commit to main, then merge the env/hub-next → env/hub Promoter PR (gh pr list -R PittampalliOrg/stacks --state open --search 'Promote').
- Spoke-workloads change: commit to main. Ryzen reconciles overlays/ryzen @ main directly; dev/staging pick it up via source-hydrator + their Promoter step.

Trigger immediate refresh on ryzen instead of waiting for the poll interval:

kubectl --context admin@ryzen -n argocd annotate application ryzen-<svc> argocd.argoproj.io/refresh=hard --overwrite
# or: deployment/scripts/ryzen-sync.sh   (hard-refreshes root-ryzen)

If a sync is stuck "another operation is already in progress", argocd app terminate-op <app> then retry with --replace.
For child Application spec changes that don't propagate, the env/hub Promoter ladder might be stuck: see runbooks/manage-gitops-promoter.md. The fast path is gh pr create --base env/hub --head env/hub-next + merge if Promoter hasn't auto-created.

"I updated SWE-bench evaluator/coordinator and need to deploy/test"

Confirm the workflow-builder commit includes the intended services/swebench-evaluator/Dockerfile pin or coordinator changes, and is pushed to origin/main so the GHCR outer-loop can build it. For ryzen, update the stacks workloads pin on main and let ryzen's local ArgoCD pick it up.
Watch hub Tekton build swebench-evaluator:git-<workflow-builder-sha> and capture the GHCR digest from release-pins/workflow-builder-images.yaml / generated workflow-builder-system overlay.
Track dev-swebench-coordinator to Synced/Healthy, then verify the live Deployment has SWEBENCH_EVALUATOR_IMAGE=ghcr.io/pittampalliorg/swebench-evaluator:git-<sha> with the expected digest-backed release.
Run a focused SWE-bench canary: one known gold patch that resolves, one empty-patch case that returns empty_patch, and, when available, an environment/build validation case. Evaluator Jobs should use the expected resource class and ttlSecondsAfterFinished=3600.
For UI-visible validation, create or trigger a Benchmarks page run (/workspaces/<slug>/benchmarks) and confirm artifact SHA-256s, provenance, official result, raw harness notes, report path, and job name appear in the run API/UI. The evaluations skill has the DB/coordinator smoke path for deterministic runs.

For DeepSeek SWE-bench validation, use the direct DeepSeek model specs (deepseek/deepseek-v4-pro, deepseek/deepseek-v4-flash) and confirm the selected agent runtime reports provider deepseek with llm-deepseek-v4-* components. Effective concurrency is the minimum of UI/requested concurrency, runtime slots, per-sidecar Dapr workflow capacity, global benchmark caps, sandbox headroom, and model caps; when in doubt read the evaluations skill references/swebench-concurrency.md before changing stacks values.

"A workflow-builder agent is silent after adding an MCP/OAuth connection"

Use runbooks/debug-workflow-builder-mcp-auth.md. The short path:

Confirm workflow-builder, activepieces-mcps, and knative-serving are Synced/Healthy on the target cluster.
Confirm the piece appears in activepieces-mcp-catalog and its ap-<piece>-service KService is Ready. The URL should be http://ap-<piece>-service.workflow-builder.svc.cluster.local/mcp with no explicit :3100.
Confirm the mcp_connection.connection_external_id points at an active app_connection.external_id; MCP credentials flow through X-Connection-External-Id, not inline secrets in manifests.
Fix the agent's mcp_connection/agentConfig.mcpServers rows, then launch a fresh session and check the per-session runtime pod env for DAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON (resolved per-session by the BFF — there is no generated agent-runtime-<slug> Deployment anymore).
Wake/test the agent and read dapr-agent-py logs for [mcp-bootstrap] connected ... and tool registration. A first health probe may time out during Knative cold start; retry before declaring the KService broken.

"A workflow-builder runtime pod is 1/2 or daprd readiness is false"

Use runbooks/debug-dapr-sidecar-stale-readiness.md. First identify whether the app container or daprd is not ready. If the app container is ready but daprd reports ERR_HEALTH_NOT_READY for grpc-api-server / grpc-internal-server, check recent dapr-system placement/scheduler churn, verify the control plane is healthy now, then recycle only the affected workflow-builder Deployment. Do not clear state stores or restart Dapr control-plane components unless they are still unhealthy.

"An ArgoCD app is OutOfSync / stuck"

Query hub ArgoCD even for dev/staging: kubectl --kubeconfig ~/.kube/hub-config get applications.argoproj.io -n argocd.
Check kubectl get app <name> -n argocd -o jsonpath='{.status.operationState.phase}'.
Phase=Running for hours? Check .status.operationState.message — usually waiting for completion of hook batch/Job/db-migrate. Drill: kubectl get jobs -n workflow-builder on the spoke; if db-migrate is stuck Terminating, see runbooks/recover-stuck-job-finalizer.md.
Controller log shows "Skipping auto-sync: failed previous sync attempt"? ArgoCD won't retry the same revision — see runbooks/recover-stuck-promotion.md (terminate-op + force sync via argocd CLI on Tailscale).
Job Pod is Init:ImagePullBackOff with "not found"? The image isn't on ghcr.io yet — the outer-loop build didn't produce that tag; rebuild from the source commit (see runbooks/debug-funnel-orphan-tag.md).
Need a fleet review or decide whether legacy apps should be removed? Use runbooks/review-argocd-app-health.md before applying fixes. It covers keep/remove decisions, stale status cache, ExternalSecret/Tekton default drift, Tailscale egress mutation, and hub promotion.

"Review all degraded/out-of-sync apps and remove legacy resources"

Use runbooks/review-argocd-app-health.md. The short rule: identify whether each resource is still part of the current system before fixing drift. Known outcomes:

Keep agent-sandbox-crds / <spoke>-agent-sandbox-crds; OpenShell and agent-runtime controllers require those CRDs.
Remove AutoKube references; AutoKube is legacy in this repo.
The old hcloud-spoke Crossplane AzureWorkloadIdentity claim/provider path is legacy; hcloud spoke lifecycle now uses hcloud/talos/kubernetes/terraform providers and existing Azure Workload Identity configuration.
For needed apps, prefer making desired manifests match API-controller defaults over broad ignoreDifferences; use ignores only for intentional operator mutation like Tailscale egress Services.

"GitHub webhook didn't fire / image build doesn't reach ghcr.io"

Triage by gh api .../hooks/<id>/deliveries status_code first — there are TWO common failure modes on the same path:

status_code: 0 + dig @1.1.1.1 tekton-hub.tail286401.ts.net NXDOMAIN → Tailscale Funnel orphan-tag on ts-tekton-github-triggers proxy. The policy.hujson lost a tag the device still uses; control plane drops the funnel cap. See runbooks/debug-funnel-orphan-tag.md (Funnel orphan tag section).
status_code: 202 (accepted) but no PipelineRun on hub → EL processing failure. el-github-outer-loop logs show Post "": unsupported protocol scheme "" at sink/sink.go:413 for the matching /triggers-eventid. Same runbook, "EL processing failure" section. Workaround: skopeo-mirror to ghcr.io + bump release-pins manually until the EL is fixed.

"I edited a workflow JSON spec — when does it deploy?"

Workflow JSONs at services/<agent>/<name>.workflow.json in the workflow-builder repo are not baked into the workflow-builder image (the production Dockerfile copies src/ and drizzle/ only — services/ is excluded). Spec changes (new prompt, agentKwargs, maxTurns, etc.) require a manual DB upsert against the spoke's postgres. Either run node scripts/<workflow>.mjs --user-email ... from a pod with DATABASE_URL set, or directly UPDATE workflows SET spec = $jsonFromFile.spec, nodes = ..., edges = ... WHERE id = '<workflow-id>'. Image rebuilds alone won't roll the change. See runbooks/upsert-workflow-json.md.

"I shipped a migration but the new columns aren't on dev/staging"

Almost always: the SQL file in drizzle/ is missing from drizzle/meta/_journal.json. npx drizzle-kit migrate (the db-migrate Sync hook) silently skips files without journal entries — Job exits 0 but nothing gets applied. See runbooks/fix-drizzle-migration.md. (BFF will then 500 on every query that includes the new column.)

"I want to track a promotion in flight"

Start with workflow-builder's admin deployment inventory when available; it shows desired image, live images, drift, build, and promotion metadata in one place. The hub ArgoCD UI now has a GitOps Promoter extension for visualizing Promoter CRDs, and PromotionStrategy + ChangeTransferPolicy + spoke ArgoCD apps remain the authoritative lower layers. See runbooks/track-promotion-state.md for both views and a CLI cheat-sheet. Most "stuck" reports are actually normal ~3 min source-hydrator poll cycles.

"workflow-builder works on ryzen but is broken on dev/staging"

Treat this as environment drift, not a live-patch task. Check the promoted spoke runtime env, tailnet exposures (the workflow-builder L4 LoadBalancer Service + tls-terminator sidecar, and any device-backed Ingresses like phoenix-*), ACL policy, spoke API VIP grants, and stale hub hydration. Typical declarative fixes are in spoke-workloads-appset.yaml, packages/base/manifests/tailscale-ingresses/, and policy.hujson. See runbooks/reconcile-workflow-builder-spoke-environment.md.

"I need to upgrade GitOps Promoter or fix the Promoter UI"

Use runbooks/manage-gitops-promoter.md. The current deployment pattern is: keep the latest published Helm chart unless a newer chart exists, override manager.image.tag when the app release is newer than the chart appVersion, and manage the ArgoCD UI extension through argocd-gitops-promoter-ui plus bootstrap deployment/config/argocd-values.yaml. Do not hand-patch long-term state without committing it to stacks.

"Which image / commit is live on dev or staging?"

Use the workflow-builder admin Deployments view or the hub inventory endpoint first. It is backed by gitops-deployment-inventory on the hub and is the fastest way to compare release-pins, Argo live images, promotion SHAs, and outer-loop build status. See runbooks/track-promotion-state.md.

"workflow-builder Deployments shows fetch failed"

First distinguish UI auth from inventory transport. From inside the workflow-builder pod, WORKFLOW_BUILDER_GITOPS_INVENTORY_URL should be http://gitops-inventory-hub-egress.tailscale.svc.cluster.local:8080/inventory.json. The egress Service in tailscale should target tailscale.com/tailnet-fqdn: gitops-inventory-hub-node.tail286401.ts.net, port 8080. Do not target gitops-inventory-hub.tail286401.ts.net from the egress Service; that is a Tailscale service-host VIP, not a tailnet node. See runbooks/track-promotion-state.md.

"A promoted-spoke Tailscale Ingress has no address, a -1 suffix, or stale DNS"

First check whether the Ingress has tailscale.com/proxy-group. Promoted-spoke app URLs such as phoenix-* are normally device-backed Tailscale Ingresses, not svc:* service-hosts. Debug stale tailnet devices, stale Tailscale Services, and operator-managed Secret metadata with runbooks/debug-device-backed-tailscale-ingress.md. (workflow-builder-* is no longer an Ingress — it is an L4 LoadBalancer Service + tls-terminator sidecar since PR #2319; mcp-gateway is in-cluster only. See reference/access-paths.md.)

"A ProxyGroup service-host VIP has no address or TLS/cert is broken"

If the resource is a ProxyGroup-hosted service such as argocd-hub, nocodb-hub, or gitops-inventory-hub, debug service-host tags, not Funnel. Check the Ingress tailscale.com/tags, policy.hujson autoApprovers.services, the Tailscale Service tags, and the proxy pod Self.Tags / CapMap["service-host"]. See runbooks/debug-proxygroup-service-host.md.

"Hub ArgoCD can't reach a spoke / `spoke-<name>` shows ComparisonError / a new spoke won't register"

Sync transport (all spokes): under argocd-agent each spoke reconciles its own apps locally; the agent dials the hub principal OUTBOUND (8443) over tailnet mTLS. The hub→spoke kube-API reach (apiserver-proxy SNI, ryzen host-passthrough) is RETIRED for sync. A spoke-<name> ComparisonError on the hub principal is usually a stale/misconfigured AGENT MAPPING (cluster-<spoke> Secret: server=...?agentName=<spoke> + embedded mTLS, NO bearerToken) or an agent that isn't dialing in — check the agent pod on the spoke and the principal logs, not a kube-API SNI path.
Registration (enroll, not register-spoke): register-spoke-with-hub.sh is RETIRED. dev enrolls via deployment/scripts/argocd-agent/enroll-dev-agent.sh (MANAGED agent); ryzen enrolls via deployment/scripts/argocd-agent/enroll-ryzen-agent.sh (AUTONOMOUS agent — mints the agent mTLS cert, applies the packages/components/hub-management/manifests/ryzen-agent-bootstrap kustomize component including the root-ryzen app-of-apps @ main, runs argocd-agentctl agent create ryzen to write the cluster-ryzen AGENT MAPPING, stages the Headlamp Secret, hard-refreshes root-ryzen). For the MANAGED spokes (dev/staging) the spoke-clusters-appset / spoke-workloads-appset still fan cluster Secrets into Applications the principal pushes down; edit the packages/components/hub-spoke-appsets/ copy, NOT the unused hub-base copy. Ryzen is NOT driven by these appsets — it reconciles its own apps via root-ryzen.

See reference/architecture.md ("Hub → spoke ArgoCD connectivity" and "Spoke registration").

"I need kubectl on a spoke (dev / staging) and Tailscale isn't working"

See reference/access-paths.md for normal paths and runbooks/access-spoke-cluster-fallback.md for the Crossplane-kubeconfig-secret fallback.

"OAuth / social login broken — `client_id and/or client_secret passed are incorrect`"

Almost always: KeyVault *-CLIENT-ID-* and *-CLIENT-SECRET-* were rotated at different times (compare attributes.updated). See runbooks/rotate-oauth-secret.md. Watch for the ESO refresh ↔ pod restart race — reference/secret-flow.md. If login works but the GitHub repo picker is missing org repos, it's NOT a secret problem — see the per-cluster OAuth-app org-grant gotcha below.

Critical gotchas (memorize these)

Ryzen is an AUTONOMOUS argocd-agent spoke with its OWN local ArgoCD; no local Gitea. Ryzen reconciles packages/overlays/ryzen @ main DIRECTLY via root-ryzen — GitHub main is the source for EVERYTHING ryzen (including image-tag bumps via Skaffold outer-loop + commit-pin.sh). There is no inner-loop branch (retired), no env/spokes-ryzen, no source-hydrator and no Promoter on the ryzen lane. (Hub itself still uses source-hydrator + Promoter: packages/overlays/hub → env/hub-next → env/hub, manual-merge PR.)
The idpbuilder + local-Gitea path is retired. See ryzen-spoke-bootstrap skill for the new bootstrap (talosctl + helm + kubectl). Old runbooks referencing idpbuilder stacks sync are obsolete.
Recreate automation is script-driven (named entry points). Use these instead of ad-hoc rebuilds:
- dev: deployment/scripts/talos-hetzner/recreate-dev.sh is the ORCHESTRATOR — it wraps data backup/restore (environment_image_builds/agents/workflows) + provision-spoke.sh + bootstrap-spoke-deps.sh + argocd-agent/enroll-dev-agent.sh + the verify gate. Dev rebuild entry point.
- ryzen: deployment/scripts/bootstrap-spoke-cluster.sh --recreate now ENROLLS the autonomous agent via deployment/scripts/argocd-agent/enroll-ryzen-agent.sh + the packages/components/hub-management/manifests/ryzen-agent-bootstrap kustomize component (agent-autonomous bundle + params mode=autonomous + cluster-ryzen-local alias + stacks-repo-read + cert ExternalSecrets + root-ryzen app-of-apps @ main; argocd-agentctl agent create ryzen writes the agent mapping; hard-refreshes root-ryzen). register-spoke-with-hub.sh is RETIRED and NO LONGER CALLED. The --ts-acl-mode / --ts-host-passthrough flags are VESTIGIAL (parsed for compat, ignored). ryzen reconciles overlays/ryzen @ main directly (no inner-loop, no source-hydrator).
- hub: deployment/scripts/recreate-hub.sh modes --verify-only / --seed-secret / --fixups / --dry-run-clone / --in-place --confirm-wipe hub-cluster. It NEVER hcloud-deletes the 5 ash servers; --in-place does a rolling talosctl reset reusing talos-cluster/main/secrets/hub-secrets.yaml (preserves etcd identity), re-apply, re-bootstrap ONE CP, and bootstraps onepassword-sa-token via op read (NOT JWKS); --dry-run-clone rehearses on a throwaway cluster via provision-spoke.sh. PLUS deployment/scripts/hub-verify-gate.sh (a 9-check read-only convergence gate) and a self-healing kube-system-fixups CronJob (packages/components/hub-management/manifests/kube-system-fixups/) that re-applies the Flannel --iface + CoreDNS anti-affinity patches Talos does not persist.
- Recreate-hardening hands-off fixes (PR #2395; full detail in cluster-desired-state/runbooks/recovery-and-gotchas.md):
  - root-ryzen repo-server cold-start race — on a fresh ryzen recreate the local controller's FIRST root-ryzen comparison can race the not-yet-Available argocd-repo-server (dial :8081 refused) → ComparisonError/sync=Unknown, no child apps, no re-queue for ~5 min. enroll-ryzen-agent.sh step 6b now waits rollout status deploy/argocd-repo-server then annotate application root-ryzen argocd.argoproj.io/refresh=hard; bootstrap-spoke-cluster.sh step 10 hard-refreshes again (re-compare vs the latest main HEAD). Both non-fatal.
  - Headlamp kubeconfig stale after EVERY spoke recreate (dev + ryzen) — hub Headlamp builds its kubeconfig only in its generate-kubeconfig init-container, so a pod predating the recreate keeps serving the OLD spoke endpoint/CA/token even after enroll-{dev,ryzen}-agent.sh step 5b re-stages the headlamp-cluster-<spoke> Secret. Both enroll scripts now kubectl -n headlamp rollout restart deploy/hub-headlamp deploy/hub-headlamp-embedded (guarded on deploy existence, non-fatal, off the critical path).
  - provision-spoke.sh --destroy deletes Hetzner servers in parallel — the per-server hc server delete calls are now concurrent (was sequential ~18s each); ~156s → ~20s for a 9-node dev spoke.
- Validated hands-off: ryzen bootstrap-spoke-cluster.sh --recreate = 13m9s (64/65 Synced/Healthy, zero manual intervention); dev recreate-dev.sh = 20m32s.
Hub secret root is 1Password, NOT Azure Workload Identity (migrated 2026-06). The hub's 21 ExternalSecrets resolve from the onepassword-store ClusterSecretStore (ESO onepasswordSDK provider → the dedicated hub-eso 1Password vault). Bootstrap root-of-trust = ONE scoped read-only 1Password Service-Account token (hub-eso-reader) in Secret onepassword-sa-token (ns external-secrets), persisted at op://CLI/<id>/credential and read at recreate via the operator's developer SA token (op read). The old azure-keyvault-store CSS + Azure KV (keyvault-thcmfmoo5oeow) + the AD App + the Azure OIDC/JWKS federation are DORMANT (not deleted); sync-jwks-to-azure.sh is NO LONGER in the hub recreate path (it is a SPOKE-only tool now). Spokes are UNAFFECTED — they still read hub-mirrored secrets via the ESO kubernetes-provider hub-secrets-store ClusterSecretStore over Tailscale, regardless of how the hub populates its k8s Secrets. The hub spoke-secrets ExternalSecrets that mirror into <cluster>-shared-secrets + tailnet-ca now read from onepassword-store. See reference/secret-flow.md.
Workflow-builder image pin has two visible truths (and ryzen's is now a rendered Component, NOT the bare images: block — C1, 2026-06-04). For workflow-builder + workflow-mcp-server the bare images: block was DELETED from packages/components/workloads/workflow-builder/manifests/kustomization.yaml; that kustomization now components:-includes the render-generated packages/components/workloads/workflow-builder-ryzen-image/kustomization.yaml, which carries the workflow-builder + workflow-mcp-server pin (newName/newTag) and IS ryzen's effective image. commit-pin.sh for these two services UPSERTS the flat pins file packages/components/hub-spoke-appsets/release-pins/workflow-builder-images-ryzen.yaml (images/imageRefs/digests/sourceShas) AND renders + commits the workflow-builder-ryzen-image Component LOCALLY in the same push (it runs WFB_RENDER_ENVS=ryzen scripts/gitops/render-workflow-builder-release-overlays.sh inside its fresh hard-reset ~/.cache/skaffold/stacks-ryzen clone — the render is deterministic, byte-identical to CI — wfb PR #37). It does NOT edit the manifests newTag. After pushing it refresh=hardes the ryzen SPOKE-LOCAL app, so ryzen reconciles in SECONDS (no CI wait, no 30s poll). The stacks CI action .github/workflows/render-ryzen-image.yml is UNCHANGED but is now just a DRIFT-CORRECTION SAFETY NET — it re-renders on push and commits only on a diff, so it NO-OPS when commit-pin's local render already matches. (This is workflow-builder/workflow-mcp-server ONLY; every other Skaffold-owned service — workflow-orchestrator, function-router, mcp-gateway — STILL pins via the bare packages/components/workloads/<svc>/manifests/kustomization.yaml newTag, edited directly by commit-pin.) The dev-workflow-builder Application may still show release-pin spec.source.kustomize.images overrides to the same tag. If ryzen reverts while dev stays correct, check the flat ryzen pins file AND the rendered workflow-builder-ryzen-image Component (commit-pin should have rendered it; the drift-net CI is a backstop), commit any fix to origin/main (ryzen reconciles main directly — no inner-loop push), and verify both ryzen's live Deployment image (via --context admin@ryzen) and hub's dev-workflow-builder.status.summary.images. RESOLVED 2026-06-05 — ryzen is now single-pin (Component only): the packages/overlays/ryzen app-of-apps had ALSO been patching the ryzen-workflow-builder Application's OWN spec.source.kustomize.images to a hardcoded sha, which ArgoCD applies ON TOP of the rendered kustomization and which therefore silently WON over the Component (commit-pin updated the Component, the app showed Synced, but the Deployment stayed frozen on the stale override sha — argocd-repo-server restart did NOT help; it is override-precedence, not a cache). That override was REMOVED (the app keeps only its non-image patches:), so the render-generated workflow-builder-ryzen-image Component is the SOLE ryzen authority and deploy:skaffold/commit-pin rolls ryzen with NO overlay edit. Telltale of the (now-guarded) trap: the app's spec.source.kustomize.images shows an OLD sha while kubectl kustomize .../workflow-builder/manifests renders the NEW one. A CI guard .github/workflows/validate-ryzen-no-app-image-overrides.yml (+ scripts/gitops/validate-ryzen-no-app-image-overrides.sh) now fails the build if any ryzen-overlay Application reintroduces it. (dev legitimately still uses spec.source.kustomize.images — that lane has a SINGLE writer, the release-pins renderer; the anti-pattern is a SECOND authority on the SAME app, which is what ryzen had.)
The ryzen Component consumes ONLY the workflow-builder + workflow-mcp-server rows of the flat ryzen pins file — and outer-loop merges never write that file. WFB_RENDER_ENVS=ryzen scripts/gitops/render-workflow-builder-release-overlays.sh derives workflow-builder-ryzen-image/kustomization.yaml from just those two services' rows in release-pins/workflow-builder-images-ryzen.yaml; any other service's row there is INERT (their ryzen pins live in the bare workloads/<svc>/manifests/kustomization.yaml). The GitHub outer-loop (merge to wfb main) builds images and writes the DEV pins file (workflow-builder-images.yaml) + dev overlay only — it does NOT touch ryzen pins. To deliver an outer-loop-built commit to ryzen, NEVER rebuild locally (same git-<sha> tag, different digest — see the skaffold-dev-loop skill): either commit-pin the existing GHCR tag (SKAFFOLD_IMAGE=ghcr.io/pittampalliorg/<svc>:git-<sha> bash skaffold/hooks/commit-pin.sh <svc> from the wfb repo) or do it manually in stacks — edit the flat ryzen pins file, run WFB_RENDER_ENVS=ryzen scripts/gitops/render-workflow-builder-release-overlays.sh, commit/push (on the NixOS HTTPS-403, push with the gh-token URL: git push "https://x-access-token:$(gh auth token)@github.com/PittampalliOrg/stacks.git" HEAD:main), then refresh=hard the ryzen-<svc> app. Note workflow-mcp-server is an actually-deployed workload since 2026-06 (was manifest-only): Deployment + Service-workflow-mcp-server.yaml (port 3200) added to the workflow-builder kustomization, DATABASE_URL/INTERNAL_API_TOKEN via envFrom workflow-builder-secrets; it hosts the goal MCP tools + workflow tools, so its pin matters now.
GitOps delivery webhook/relay topology (why ryzen needs a spoke-local refresh, verified v0.8.1, 2026-06-04). The 3 GitHub webhooks are all HUB-FACING (Tailscale Funnel): tekton-hub (build EL), argocd-webhook-hub (/api/webhook), gitops-promoter-webhook-hub — NONE reaches a spoke directly. The argocd-agent principal relays a hub-side argocd.argoproj.io/refresh annotation to MANAGED agents (dev reconciled in ~3s in a live test) but does NOT relay it to AUTONOMOUS agents: ryzen has no argocd-server (no inbound /api/webhook) and a hub-side refresh on the ryzen mirror NEVER reached the spoke live (an argocd-agent code-read suggests #447/v0.2.0 should enable autonomous refresh, but it did not reproduce — trust the live behavior). So ryzen's ONLY fast path is the SPOKE-LOCAL refresh=hard that commit-pin/ryzen-sync.sh issues; otherwise it's the 30s poll. Dev hydration is webhook-accelerated: stacks PR #2449 added argocd.argoproj.io/manifest-generate-paths (pointing at each spoke's workflow-builder-system-overlays/<spoke> dry-source path) to the spoke-workloads hydrator appset, so the hub argocd-webhook-hub fires hydration on a release-pin render into that overlay (without it, the hydrator waited ~120s for its poll); that hydrator app (spoke-dev-workflow-builder) is HUB-reconciled, so the webhook drives it directly — no agent relay on that hop. Ryzen is unaffected (sourced by root-ryzen, not this hub appset).
Outer-loop release handoff can be direct-main or PR-mode. Hub Tekton update-stacks is the source of truth: inspect its logs and Git state to see whether release metadata was pushed directly to origin/main or placed on a release/workflow-builder-* PR branch. Direct human edits to release pins should be exceptional and must pass scripts/gitops/validate-workflow-builder-release-pins.sh.
stacks-environments PromotionStrategy has autoMerge: false. Unlike workflow-builder-release (env/spokes-dev only — staging dormant, PR #2436) which auto-merges after argocd-health, the env/hub PR (gitops-promoter-*[bot]: Promote <sha> to env/hub) requires manual merge. Every change under packages/overlays/hub (which includes spoke-workloads-appset.yaml, AppSet templates, etc.) opens such a PR and the dev/staging cascade is blocked until it's merged. Easy to miss because workflow-builder-release IS auto.
Hub promoter status can lag branch tips. If env/hub-next has a newer hydrated SHA but stacks-environments-env-hub-* still proposes the prior dry SHA/PR, annotate both PromotionStrategy/stacks-environments and the ChangeTransferPolicy with a fresh promoter.argoproj.io/refresh-ts.
workflow-builder-release can lag source-hydrator too. If env/spokes-dev-next has advanced but workflow-builder-release-env-spokes-dev-* still proposes the prior dry SHA after one poll interval, refresh PromotionStrategy/workflow-builder-release plus the dev CTP with promoter.argoproj.io/refresh-ts. Do not use hard-sync as a substitute for Promoter catching up.
Concurrent outer-loop commits can leave Promoter one dry SHA behind. A single app push can trigger multiple outer-loop updates, such as workflow-builder and workflow-orchestrator release metadata. If source-hydrator's current dry SHA is newer but the workflow-builder dev CTP keeps proposing the previous dry SHA after a poll, refresh PromotionStrategy/workflow-builder-release, the dev CTP, and hard-refresh spoke-dev-workflow-builder before declaring the rollout stuck.
Release-pin validation needs each GHCR package linked to stacks via Manage Actions access. The validate-workflow-builder-release-pins GitHub Action runs skopeo inspect against every image with ${{ github.token }} (which only has packages: read). PittampalliOrg's GHCR container packages are private and built by other repos (workflow-builder, opencode-durable-agent, etc.), so the workflow's token can only read them if each package has PittampalliOrg/stacks added under Manage Actions access with Role: Read. Missing link → every image fails identically with reading manifest <tag> in ghcr.io/pittampalliorg/<image>: denied (authz, not "tag missing"). Adding a new image to release-pins/workflow-builder-images.yaml requires linking its package before merging. See runbooks/grant-stacks-ghcr-package-access.md.
Release-pin hydration must be dry-source deterministic. spoke-workloads-appset.yaml should select the spoke cluster and point source-hydrator at packages/components/workloads/workflow-builder-system-overlays/<spoke>; it should not template imageRefs, sourceShas, sandbox-image env, or other release-pin values inline. Argo CD source-hydrator is dry-SHA oriented, so if rendered output depends on values outside the dry source a race can produce env/spokes-<spoke>-next with stale child Application images even while .status.sourceHydrator.currentOperation.drySHA is current. Fix stale release-pin renders by regenerating the overlays and committing a real dry-source change, not by relying on empty commits as the normal path.
Do not delete agent sandbox CRDs as "duplicates." agent-sandbox-crds is the CRD owner for the upstream kubernetes-sigs/agent-sandbox resources (Sandbox, SandboxClaim, SandboxTemplate, SandboxWarmPool). It is separate from controllers and workload apps by design so CRDs sync early. (The custom AgentRuntime CRD it used to also own is retired.)
AutoKube is legacy. If AutoKube Applications, Ingresses, ACL service approvals, or manifests appear, remove them declaratively and let Argo prune them instead of repairing them.
Argo drift review is keep/remove first, fix second. For needed resources, run argocd app diff and prefer declaring API defaults (ExternalSecret defaults, Tekton EventListener defaults, CRD defaults) over hiding real drift. Empty argocd app diff with OutOfSync status can mean stale Argo status (hard-refresh, then restart the application controller if it persists) — BUT for ExternalSecrets it usually does NOT: it's the ESO server-defaulted fields that ArgoCD's client-side diff flags but the CLI/SSA normalize away, so check the UI Diff tab, not the CLI. That class is now muted fleet-wide by a global argocd-cm resource.customizations.ignoreDifferences.external-secrets.io_ExternalSecret (2026-05-30); ESO is v2.4.1 on the external-secrets.io/v1 API. Don't burn time on controller restarts for ESO empty-diff OutOfSync — see cluster-desired-state runbook §L and runbooks/review-argocd-app-health.md.
Runtime images flow through BFF env vars, not release-pins. The per-session agent-sandbox Sandbox pod image is selected at launch time by the BFF reading env.AGENT_RUNTIME_BROWSER_USE_DEFAULT_IMAGE (browser-use-agent), env.AGENT_RUNTIME_DEFAULT_IMAGE (dapr-agent-py default), or env.AGENT_RUNTIME_CLAUDE_DEFAULT_IMAGE (claude-agent-py) — keyed off the runtime registry's imageEnvKey. These env vars are static literals on Deployment-workflow-builder.yaml. Bumping release-pins does not update them; edit the Deployment YAML and verify the live BFF pod sees the new env — the NEXT session then pulls the new image (no per-agent AgentRuntime CR to patch; the CRD + controller are retired). The static dapr-agent-py pool rolls via its own Deployment image. See runbooks/bump-image-pin-not-in-release-pins.md.
Per-session sandbox image on DEV flows pins → render → the dev overlay's SANDBOX_EXECUTION_CLASSES_JSON patch — the base manifest tag is INERT for dev. The agent-host image for per-session Sandbox pods is agentHostImage inside sandbox-execution-api's SANDBOX_EXECUTION_CLASSES_JSON (a JSON-string env var — kustomize.images cannot rewrite inside it). On dev, bump the release pins and run scripts/gitops/render-workflow-builder-release-overlays.sh: it regenerates workflow-builder-system-overlays/dev/kustomization.yaml, which patches that env var (every execution class's agentHostImage) onto the sandbox-execution-api Deployment. The hardcoded tag in the base workloads/sandbox-execution-api/manifests/Deployment-sandbox-execution-api.yaml is stale-by-design and inert wherever the overlay applies — don't "fix" it expecting a dev rollout. Rollout nuance (verified live): sessions spawned during the rollout window keep their old pod, and a mid-session pod reschedule preserves conversation history (durable state) — no need to kill in-flight sessions to land the image.
SWEBENCH_EVALUATOR_IMAGE is an env var, not a container field. The swebench-coordinator base has a kustomize replacement that copies the rewritten image from local-config Pod/swebench-evaluator-image into the Deployment env var. If a coordinator launches stale evaluator Jobs, verify both the generated overlay and live Deployment env. On ryzen, an active skaffold dev session (with ArgoCD skip-reconcile) can mask the declarative env the same way it masks workflow-builder BFF env vars; dev is the cleaner rollout target for promoted SWE-bench evaluator validation.
Sandbox templates (workspace_profile.with.sandboxTemplate) resolve via SANDBOX_TEMPLATE_IMAGES_JSON on the workflow-builder Deployment (NOT kustomize.images). For dev/staging, the release-pins renderer stamps this env var into the generated dry-source overlays. The env var is a JSON object mapping template names to image refs. Adding a new template name = (1) add a Dockerfile.<name> under services/openshell-sandbox/environments/ in the workflow-builder repo, (2) commit with subject environment(<name>): so the env-image-build pipeline fires, (3) add the image pin/rendering path in stacks.
Legacy Gitea dev-image commits are not the ryzen hot path. If an old artifact mentions chore(dev-images): deploy ... to ryzen, treat it as historical build-lane evidence. Update the stacks pin and use the GitHub branch flow (commit-pin or Promoter PR).
Orchestrator wfstate_state orphan reminders can block new StartInstance. workflowstatestore is state.postgresql v2 with tablePrefix=wfstate_. When a workflow row is purged but its actor reminder is still in dapr-scheduler-server's ETCD, daprd retries the reminder ~every 10s and logs Unable to get data on the instance: <id>, no such instance exists. The retry loop can serialize behind the workflow runtime worker queue and make new ctx.call_child_workflow / StartInstance calls hit DEADLINE_EXCEEDED after 60s. Confirm via daprd logs first. For terminal cleanup, prefer the vetted Lifecycle Controller (BFF src/lib/server/lifecycle/, stopDurableRun with mode:"purge"/"reset") or the BFF benchmark cleanup endpoint — both do scoped terminate/poll/purge per per-session app-id, and the orchestrator's purge_workflow now forwards force (purge-force, Dapr 1.17.9, which cleans the associated reminders). The lifecycle-terminal-reaper CronJob (workflow-builder ns) reconciles stuck rows on a timer (skipped while a benchmark run/lease is active). Manual wfstate_state truncation is incident recovery only, after active runs and leases are zero.
environment(<slug>): commit subject is the only trigger for the env-image-build pipeline. The hub-tekton EventListener (build-environment-image trigger in EventListener-workflow-builder-fn-builds.yaml) filters on body.commits[*].message ~= '^environment\$.+?\$:' AND a modified services/openshell-sandbox/environments/Dockerfile.<slug> path. Both conditions must hold per push. Slug is extracted via c.message.split('(')[1].split(')')[0]. Commit message typos like env(code-eval): will silently skip the build with no visible error.
ActivePieces piece MCP URLs should not include port :3100 when targeting Knative. The container listens on 3100, but callers hit the cluster-local Knative Service URL. Stale agentConfig.mcpServers or workflow configs containing http://ap-...svc.cluster.local:3100/mcp bypass Knative routing and can leave agents silent.
MCP auth is request-scoped by connection external ID. For piece MCP tools, the runtime sends X-Connection-External-Id; piece-mcp-server calls workflow-builder's internal decrypt API. Do not put OAuth tokens, decrypted credentials, or user-specific secrets into KService env, workflow JSON, or GitOps manifests. The reconciler may set a fallback CONNECTION_EXTERNAL_ID, but per-request headers are the correct multi-user path.
ActivePieces piece-runtime services are generated from DB state by an all-catalog reconciler. The activepieces-mcps reconciler (every 2 min) provisions every catalog piece's ap-<piece>-service from enabled mcp_connection rows + PINNED_PIECES, so new pieces are automatic — no manual per-piece add. Each ap-<piece>-service is the converged piece-runtime (one piece-mcp-server image, PIECE_NAME env) serving /execute (deterministic activities — replacing the deleted fn-activepieces) + /mcp + /options + /health. The set pinned (or workflow-referenced) is held at minScale=1; the rest scale to zero. If a user adds Outlook/Excel/OneDrive and the KService is missing, debug activepieces-mcp-reconciler before patching workloads by hand.
Piece-runtime KServices scale to zero by design. knative-serving must allow allow-zero-initial-scale: "true", and scale-to-zero services use initialScale: "0". Cold starts can make the first /health, /mcp, or /execute probe exceed a short timeout; retry with a longer timeout before treating it as a hard failure.
The converged piece-runtime image needs NODE_OPTIONS=--max-old-space-size=400 + a 512Mi memory limit. The piece-mcp-server image OOM-kills at MODULE LOAD under a 256Mi/384Mi limit (loading all 42 AP piece packages). If ap-<piece>-service pods CrashLoop / OOMKill on startup, verify the generated KService carries NODE_OPTIONS=--max-old-space-size=400 and a 512Mi limit before suspecting the piece code.
Dapr durable protocol compatibility depends on a single actor state store per sidecar. Current workflow-builder expects workflowstatestore to be the only actorStateStore=true Component visible in the namespace. dapr-agent-py-statestore must stay actorStateStore=false; it stores agent application state, not workflow actor state. If agent sessions hang after a runtime change, verify Component metadata before restarting pods or clearing state.
Dapr workflow cleanup is a lifecycle, not an instant delete — and it's now automated + vetted. Termination requests can return before a workflow is terminal. Every user-facing stop routes through ONE Lifecycle Controller (src/lib/server/lifecycle/{index,cascade,resolvers,reaper,ownership}.ts, stopDurableRun(target, {mode: interrupt|terminate|purge|reset})) which fans out terminate/purge explicitly per per-session app-id (the native Dapr recursive cascade doesn't cross task hubs; the orchestrator's old terminate_durable_runs_by_parent_execution was retired). It terminates/polls/purges per-session session+turn workflows first, then the parent, then reaps Sandbox CRs and flips DB rows terminal. Request/confirm (not one-shot fail-closed): a stop that can't confirm in-request returns HTTP 202 "stopping" + persists a stop_requested_at intent, and only flips DB / reaps once Dapr is confirmed terminal (via the GET …/stop/status poll + the reaper); 409 only on a genuine failure or coordinator_owned. A durable/run parent wedged awaiting a cross-app child (a sub-orchestration on a separate per-session task hub Dapr's recursive terminate can't reach) is force-finalized by confirmDurableStop after a grace once its child session is DB-terminated — this is exactly why the reaper/confirmDurableStop exist (the cross-app dispatch call_child_workflow was KEPT; fire-and-poll dispatch was tried and reverted). Two GitOps safety nets back it: the lifecycle-terminal-reaper CronJob (POST /api/internal/lifecycle/reap-terminal, reconciles stuck DB rows vs terminal/gone Dapr instances — it reconciles even during benchmark activity post-#69; only its aged-stuck pass defers to an execution owned by a still-active coordinator run) and the workflow-builder-sandbox-gc CronJob (age-based GC of orphaned per-session agent-host Sandbox CRs in the workflow-builder namespace, excludes SandboxWarmPool-owned). Dapr stateRetentionPolicy is unified at 168h across the parent (workflow-orchestrator-no-tracing) and the per-session child Configs (workflow-builder-agent-runtime, openshell-sandbox-dapr) — closing the old 168h-vs-30m split-brain that auto-purged children before the parent finished. A guarded, dry-run-by-default runbooks/phase0-lifecycle-clean-slate.{sh,md} (stacks workflow-builder component) is the one-time bulk-purge; it is NOT auto-run. If cleanup cannot prove closure, leave leases/sandboxes in place for a retry rather than creating invisible running workflows with missing workspaces.
Dapr sidecar liveness can stay green while readiness is permanently false. After placement/scheduler restarts or cert churn, workflow-builder runtime pods can show 1/2 because the app container is healthy but daprd returns ERR_HEALTH_NOT_READY: [grpc-api-server grpc-internal-server]. Logs often include Actor runtime shutting down, Placement client shutting down, or Workflow engine stopped. Verify dapr-system is currently healthy, then recycle the affected Deployment (openshell-agent-runtime, swebench-coordinator, or another Dapr-enabled runtime). See runbooks/debug-dapr-sidecar-stale-readiness.md.
Workflow JSON specs do not flow through image rebuilds. services/<agent>/<name>.workflow.json is excluded from the production Dockerfile copy list. Editing it in the repo + rebuilding doesn't change runtime behavior; the spoke's workflows.spec JSONB column is read at execution time. Updating the spec requires a DB UPDATE on each spoke. See runbooks/upsert-workflow-json.md.
ArgoCD SSA validation blocks parent-syncs-child-Application apply. When the parent app (e.g., spoke-dev-workflow-builder) tries to apply a kustomize-patched child Application (e.g., dev-browserstation), you may see Application.argoproj.io "<child>" is invalid: status.sync.comparedTo.source.repoURL: Required value. The parent's SSA payload nullifies a status field the CRD validator requires. Workaround: patch the live child directly with kubectl patch app dev-<name> --type=json -p='[{op:replace,path:/spec/source/kustomize/images/0,value:...}]'. The parent will keep retrying with the failing apply but the child's live spec is correct.
ArgoCD 3.4.x ClientSideApplyMigration wedges large CRDs (argo-cd#26279). Before doing SSA, ArgoCD 3.4.x runs a ClientSideApplyMigration step when the live object is not yet argocd-controller-owned. That intermediate client-side apply writes a last-applied-configuration annotation; for very large objects (the ~1.4MB workloads.kueue.x-k8s.io CRD) the annotation exceeds the 262144-byte etcd annotation limit and the sync wedges. Triggered on ryzen because the CRD had been hand-applied with kubectl during recovery, so kubectl co-owns it (managedFields owners = kubectl, argocd-controller, kube-apiserver, kueue). Fix is a ryzen-only overlay patch adding ClientSideApplyMigration=false to the kueue Application's syncOptions (packages/overlays/ryzen/kustomization.yaml ~line 261) — pure SSA, clean ownership transfer, no Workload CR data loss. Keep it while kubectl co-owns the CRD (harmless no-op on a clean recreate). hub/dev/staging never hit this (argocd-controller owned the CRD from the start).
A kustomize RFC6902 op: add /spec/source/kustomize REPLACES the whole node (last-writer-wins). When two co-located patch blocks both op: add to the same path (e.g. both packages/components/profiles/local-core-ryzen AND packages/overlays/ryzen patch the tailscale-operator Application's /spec/source/kustomize), the later block clobbers the earlier one entirely — you do not get a merge. The overlay runs after the component, so the overlay block wins; anything that must survive (e.g. the gitea-tailscale-backend Service $patch: delete that stops ryzen syncs failing namespaces gitea not found) must be co-located inside the winning overlay block, not split into the component. This clobber rule governs every co-located op: add between local-core-ryzen and overlays/ryzen.
KubeRay head pod doesn't auto-roll on image change. When a RayCluster spec image is bumped via kustomize.images, the KubeRay operator gradually rolls workers but the head stays on the old image until explicitly deleted (kubectl delete pod -n ray-system browserstation-head-<id>). Workers wait on head GCS via wait-gcs-ready init container, so a stuck old head blocks worker rollout too. Verify with kubectl get pod -l ray.io/cluster=browserstation -o jsonpath='{range .items[*]}{.metadata.name} {.spec.containers[?(@.name=="ray-head")].image}{.spec.containers[?(@.name=="ray-worker")].image}{"\n"}{end}'.
Buildah short-name resolution is enforced in noninteractive Tekton builds. FROM rayproject/ray:2.47.1-cpu fails with short-name resolution enforced but cannot prompt without a TTY. Always fully-qualify base images (docker.io/rayproject/...). Fix is in the Dockerfile, not the pipeline.
Unpinned pnpm → v10 fails the prod build with ERR_PNPM_IGNORED_BUILDS (wfb PR #42). A Node service whose Dockerfile does npm install -g pnpm (unpinned) gets pnpm v10, which FAILS at RUN pnpm build with ERR_PNPM_IGNORED_BUILDS — esbuild/protobufjs build scripts are blocked behind an approval gate. FIX: pin pnpm@9 (like mcp-gateway); do NOT rely on --ignore-scripts (leaves esbuild's binary missing for the build stage). KEY INSIGHT: such a prod-build break can HIDE for weeks/indefinitely because the per-service trigger only fires on a services/<svc>/ change — the image just stays frozen at the last successful build. (function-router was stuck at a May-21 image for exactly this reason; recover via the build-to-current PipelineRun technique once pnpm@9 is pinned.)
Active skaffold dev sessions can mask declarative image rollout. The ryzen Application can be Synced/Healthy and the declarative Deployment image can point at the new tag while a Skaffold-owned dev pod serves live traffic from synced source (ArgoCD paused via skip-reconcile). Verify the actual serving pod, image, and synced files before declaring ryzen done.
Skaffold-owned dev pods cache stale env vars across ArgoCD updates. A subtle variant of the above: when ArgoCD bumps an env var on Deployment-workflow-builder.yaml (e.g. AGENT_RUNTIME_DEFAULT_IMAGE), the Deployment+ReplicaSet roll, but the long-lived workflow-builder-dev-* pod was created hours/days earlier and won't restart on its own. The serving pod reads the OLD env value, so the BFF keeps spawning per-session Sandbox pods on the stale runtime image despite the manifest being "synced". Diagnose: kubectl get deploy workflow-builder -o jsonpath='{...env...}' shows the new value but kubectl exec deploy/workflow-builder -- printenv AGENT_RUNTIME_DEFAULT_IMAGE still shows the old one. Recovery: exit skaffold dev (which removes the dev override) OR kubectl delete pod workflow-builder-dev-* (forces a fresh pod from the current Deployment template). After either, verify the standard workflow-builder-* pod has the expected env and launch a fresh session to confirm the new runtime image.
The runtime image's single source of truth is the BFF env var, read per-session. There is no AgentRuntime CR to patch and no revert loop — the CRD + Kopf controller are retired. To roll the runtime forward durably: (1) bump the AGENT_RUNTIME_*_DEFAULT_IMAGE env var in the stacks Deployment YAML AND (2) verify the BFF pod (not just the manifest) sees the new value (kubectl exec deploy/workflow-builder -- printenv ...), then (3) launch a fresh session — the next agent-sandbox Sandbox pod pulls the new image. (For the static dapr-agent-py pool, roll its Deployment instead.) See runbooks/bump-image-pin-not-in-release-pins.md.
rayproject/ray:2.47.1-cpu ships Python 3.9. PEP-604 union syntax (def f(x: float | None)) fails at module import with TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'. Add from __future__ import annotations at the top of the file or use Optional[X]. Caused a head-pod CrashLoopBackOff on the Tier 2 browserstation rollout.
A release-pinned tag missing on GHCR means the outer-loop build didn't run. The local Gitea image registry on ryzen is retired (May 2026) and all clusters pull from ghcr.io/pittampalliorg/*, so the old gitea→ghcr mirror runbook was removed (its source registry no longer exists). When a git-<sha> tag is absent from GHCR, rebuild it from the source commit via the GitHub outer-loop (or a manual build-and-push to GHCR using hub ghcr-push-credentials); see runbooks/debug-funnel-orphan-tag.md for the webhook/EventListener failure that suppresses the build. The github-outer-loop EL has per-service triggers for ALL Skaffold-owned services (workflow-builder, workflow-orchestrator, function-router, mcp-gateway, swebench-coordinator — verified end-to-end 2026-06-05), each firing on a services/<svc>/ change. But because a trigger ONLY fires on a source change to its own service, a service that has had no services/<svc>/ push stays frozen at its last successful build — use the build-to-current PipelineRun technique (see the Outer-loop lane bullet) to re-pin it from current main HEAD without a source change.
Tailscale Funnel orphan tags silently break webhooks. If a tag is removed from policy.hujson but a device still uses it, the operator pod claims "Funnel on" locally but the control plane revokes the cap. Public DNS goes NXDOMAIN. Diagnostic: tailscale status --json | jq '.Self.{Tags, CapMap}' from inside the proxy pod.
ProxyGroup service-host tags are separate from Funnel tags. For hub browser VIPs, the Ingress tailscale.com/tags, Tailscale Service tags, policy.hujson autoApprovers.services, and the authenticated ProxyGroup pod tag must agree. Hub cluster-ingress should authenticate as tag:k8s-services; tag:k8s is legacy compatibility only.
Device-backed Tailscale Ingresses are not svc:* service-hosts. Promoted-spoke app URLs without tailscale.com/proxy-group (e.g. phoenix-*) register as Tailscale devices, usually tagged tag:k8s. Do not add autoApprovers.services["svc:<hostname>"] for these. A stale svc:<hostname> record can reserve the canonical DNS name and force the real device to register as <hostname>-1.
workflow-builder web exposure = Tailscale L4 LoadBalancer + in-cluster TLS, NOT an Ingress, NO Let's Encrypt (PR #2319). workflow-builder is reached at https://workflow-builder-{dev,ryzen,staging}.tail286401.ts.net via a Tailscale L4 LoadBalancer Service (type: LoadBalancer, loadBalancerClass: tailscale, tailscale.com/hostname, 443→https-tls) whose HTTPS is terminated by a per-pod nginx tls-terminator sidecar serving a persistent self-signed wildcard *.tail286401.ts.net. The old ingressClassName: tailscale + LE development-prod-cert exposure (and ryzen's brief plain-HTTP LB, #2314/#2316) is retired — recreate churn exhausted LE's 5-certs/168h limit (429 → unreachable). The wildcard is signed by the tailnet-dev-ca ClusterIssuer, restored from KV (TAILNET-DEV-CA-CRT/-KEY) via hub ns spoke-secrets Secret tailnet-ca → spoke base app packages/components/tailnet-ca. The LB Service registers as a tag:k8s device (same ACL rule as device-backed Ingresses); it is NOT a svc:* service. mcp-gateway was dropped from the tailnet — MCP_GATEWAY_BASE_URL=http://mcp-gateway.workflow-builder.svc.cluster.local:8080; ORIGIN/APP_PUBLIC_URL stay https://workflow-builder-<cluster>.... Manifests: Service-workflow-builder-tailnet.yaml (dev/staging) / workflow-builder-tailnet-lb/ (ryzen) + the sidecar ConfigMap/Certificate under workflow-builder/manifests/. See reference/access-paths.md and reference/architecture.md.
workflow-builder-<cluster> 502 for browsers but 302 for curl (PR #2327). The tls-terminator nginx default 8k proxy header buffer overflows on SvelteKit auth's large Set-Cookie headers, so browsers get 502 while bare curl (small headers) returns 302 — masking it. Fix lives in the sidecar ConfigMap-workflow-builder-tls-terminator.yaml (proxy_buffer_size 32k; proxy_buffers 8 32k; proxy_busy_buffers_size 64k; large_client_header_buffers 4 32k). Verify HTTPS exposure with a REAL browser (or curl with full browser headers), and diagnose via the sidecar nginx error log. Opening the URLs cleanly also needs workstation trust of the "PittampalliOrg Tailnet Dev CA" (nixos-config 44ba6324; the Chrome NSS seed is required on NixOS because security.pki doesn't cover Chrome).
gitea-registry-creds imagePullSecret is RETIRED fleet-wide (PR #2317) — do not re-add it. It was a dead reference (the secret was never produced on any cluster); PR #2317 removed it from 23 manifests + 2 SAs. All images pull from ghcr.io/pittampalliorg/* via ghcr-pull-credentials. (deployment/scripts/trigger-tekton-builds.sh keeps a same-named build-side PUSH credential — different, intentionally kept.)
Stale tailnet devices: gated pre-recreate script + the hub sweeper backstop (PRs #2322/#2325). The hard on-recreate guarantee against <hostname>-N collisions is the gated deployment/scripts/cleanup-tailnet-devices.sh run pre-recreate. As a hygiene backstop, hub CronJob tailnet-device-sweeper (ns tailscale, every 15m) deletes OFFLINE stale spoke devices (lastSeen > 30m, best-effort). API gotcha: the device hostname field DROPS the -N suffix (a live device and its dead -N twin share one hostname) — match on the MagicDNS name; lastSeen IS a reliable liveness signal. An in-Composition pre-onboarding cleanup was deliberately NOT built (a function-pipeline error would halt ALL spoke provisioning).
env/hub-next can go MISSING after a hub promotion PR merges. GitOps Promoter logs ChangeTransferPolicyNotReady "couldn't find remote ref env/hub-next" and PromotionStrategy stacks-environments goes NotReady (flooding warning events). It is NOT GitHub auto-delete (delete_branch_on_merge=false); only env/hub-next is affected (spoke -next branches self-heal via their busy hydrators; the idle hub hydrator doesn't recreate it). When active == proposed dry SHA (no pending hub change), recreate it: git push origin origin/env/hub:refs/heads/env/hub-next; the Promoter reconciles to Ready. See runbooks/manage-gitops-promoter.md.
Tailscale operator Secret metadata matters after manual recovery. Ingress proxy state Secrets for device-backed Ingresses (e.g. tailscale/ts-phoenix-tailscale-*-0) must keep labels tailscale.com/managed=true, tailscale.com/parent-resource=<ingress>, tailscale.com/parent-resource-ns=<namespace>, and tailscale.com/parent-resource-type=ingress. If a manual auth/key repair leaves a huge kubectl.kubernetes.io/last-applied-configuration annotation or strips labels, the endpoint may work while ArgoCD stays Progressing; restore labels and remove the stale annotation.
Tailscale egress targets nodes, not service-host VIPs. gitops-inventory-hub.tail286401.ts.net is a service-host VIP backed by cluster-ingress; egressing to it produces "node not found", ECONNREFUSED, or timeouts. Spoke inventory fetches must use gitops-inventory-hub-node.tail286401.ts.net:8080 through gitops-inventory-hub-egress.tailscale.svc.cluster.local:8080.
Tailscale operator mutates egress Services. It writes /spec/externalName and may add /spec/ports/0/targetPort; Argo Applications that own egress Services should ignore both fields or they will stay OutOfSync despite working traffic.
Dev/staging service URLs must be declared, not inferred from ryzen. Phoenix URLs belong in the spoke ApplicationSet template with {{cluster}}, with matching phoenix-* Tailscale Ingresses. MCP_GATEWAY_BASE_URL is now the in-cluster http://mcp-gateway.workflow-builder.svc.cluster.local:8080 (mcp-gateway dropped from the tailnet, PR #2319 — no mcp-gateway-* Ingress). Add policy.hujson svc:* approvals only when the hostname is actually served by a ProxyGroup/Tailscale Service.
Spoke API VIPs need both service approval and Kubernetes grants. dev-api-v2/staging-api-v2 service-hosts need autoApprovers.services entries, and the authenticated tag:spoke-api devices need a Kubernetes impersonation grant to tag:k8s with system:masters. Re-authenticate the spoke ProxyGroup after ACL changes if the device still has stale caps.
ESO refresh ↔ pod restart race. When rotating a KeyVault secret, ESO may not finish writing the K8s Secret before a Deployment restart kicks off. The new pod reads the stale value. Always verify the K8s Secret head matches the new value before triggering the restart.
GitHub repo picker missing org repos = the per-cluster OAuth app lacks the org grant — dev and ryzen are SEPARATE GitHub OAuth apps. Workflow Connections (Dev) (Ov23linctlmmlA9F8odt) and Workflow Connections (Ryzen) (Ov23liqsg0KjlK52R2at) each need PittampalliOrg access granted in the GitHub org's OAuth-app settings; the grant applies LIVE to existing tokens (no re-auth, no secret rotation). The client_id a cluster uses comes from the platform_oauth_apps DB table (seeded by sync-oauth-apps), not an env var — check there before assuming a config drift. Separately, /api/scm/repos paginates since wfb PR #89 (5 pages / 500 repos) — before that a single-page fetch silently truncated the picker at 100 repos, which mimics a missing-grant symptom.
Hub→spoke kube-API reach is RETIRED as the ArgoCD sync path (argocd-agent v0.8.1). Each spoke reconciles its own apps locally; the agent dials the hub principal OUTBOUND (8443). The cluster-<spoke> Secret is an AGENT MAPPING (server=https://argocd-agent-resource-proxy:9090?agentName=<spoke> + embedded mTLS, NO bearerToken), not a kube-API endpoint. The legacy hub→ryzen apiserver-proxy SNI path (operator v1.92.4+ strictly validates the wire SNI; cluster-ryzen server: https://ryzen-operator... + hub CoreDNS rewrite to ryzen-api-egress; the stale ryzen-operator-1 device + curl --connect-to verify) and the ryzen host raw-TCP tailscale serve --tcp=6443 passthrough now exist ONLY for Headlamp (the host-passthrough endpoint lives in the dedicated headlamp.dev/cluster=true Secret, separate from the agent mapping) — NOT for ArgoCD sync. The ryzen→hub ESO secret-fetch transport (a different device + RYZEN-side CoreDNS rewrite) is also unaffected. See reference/architecture.md. Don't use MagicDNS directly from hub pods — it fails or hangs.
Spokes are registered by an Argo CD cluster Secret + two appsets. The cluster Secret (argocd.argoproj.io/secret-type: cluster + stacks.io/{hub-managed,cluster-role,platform} labels + spoke-cluster/stacks.io/source-branch annotations) in hub argocd ns is the contract. spoke-clusters-appset.yaml (clusters generator) templates the root spoke-<name> Application (path packages/overlays/<name>, targetRevision from the source-branch annotation); spoke-workloads-appset.yaml adds the workload.stacks.io/workflow-builder=true selector and templates spoke-<name>-workflow-builder. ryzen OMITS the workflow-builder label on purpose (its overlay composes workflow-builder-system directly), so only spoke-clusters generates for ryzen. dev/staging cluster Secrets are minted by Crossplane onboarding; cluster-ryzen is a STATIC GitOps-delivered Secret. GOTCHA: there is a SECOND, unused spoke-clusters-appset.yaml under packages/components/hub-base/apps/ that hardcodes targetRevision: main + has the empty-kustomize: {} hydrator-stall trap — the hub uses only the hub-spoke-appsets copy. Edit the hub-spoke-appsets copy. See reference/architecture.md.
Hub build nodes are the default build capacity. The current hub baseline is three cpx41 control/management nodes plus two tainted ccx33 build workers (stacks.io/build-pool=hub, upgradeable to ccx43). Do not remove the build-node taint to "fix" scheduling; add the node selector/toleration to the PipelineRun template.
ProxyGroup auth must target the intended context. kubectl --kubeconfig ~/.kube/config does not select a cluster by itself; it still uses that file's current context. For dev/staging/hub repairs, minify the intended context into a temporary kubeconfig or set KUBECONFIG to the Crossplane fallback kubeconfig before running deployment/scripts/tailscale/proxygroup-auth.sh. For kube-apiserver ProxyGroups, the script patches the *-config secret, not just TS_AUTHKEY env.
Ryzen reconciles main DIRECTLY; pushing to main IS how content reaches ryzen. The inner-loop branch is RETIRED (deleted). Ryzen's LOCAL ArgoCD root-ryzen tracks packages/overlays/ryzen @ main — commit-pin.sh (and any manifest edit) just commits to main; ryzen reconciles on its next poll, or force an immediate re-compare with deployment/scripts/ryzen-sync.sh (hard-refreshes root-ryzen). There is NO env/spokes-ryzen, NO source-hydrator, NO Promoter on the ryzen lane, so the empty-drySource.kustomize hydrator-stall bug does NOT apply — if a frozen ryzen, hard-refresh root-ryzen, don't look for an inner-loop advance.
Manual branch reconciliation is not part of the normal ryzen loop. Ryzen-related Applications source GitHub main directly via the local root-ryzen; there is no separate ryzen branch to keep in sync.
argocd-hub.tail286401.ts.net works even when other Tailscale ProxyGroups are down. It's an independent ProxyGroup. When per-spoke Tailscale access is broken, you can still drive ArgoCD ops from the hub via argocd login argocd-hub.tail286401.ts.net --grpc-web.
GitOps Promoter app releases may be newer than the Helm chart appVersion. Verify both upstream release and Helm chart metadata. As of 2026-04-24, the controller runs v0.27.1; the latest Helm chart is 0.6.0 with appVersion: 0.26.2, so stacks keeps chart 0.6.0 and overrides manager.image.tag.
Promoter UI patch hooks need a shell-capable kubectl image. registry.k8s.io/kubectl is distroless and has no /bin/sh; use alpine/k8s:<version> for shell-scripted hook jobs. The ArgoCD Helm chart's server container is named server, even though the Deployment is argocd-server.
Hub source-hydrator status can pin a stale dry SHA. If root-application.status.sourceHydrator.currentOperation.drySHA stays behind origin/main, remove currentOperation and lastSuccessfulOperation from status and hard-refresh the app. See runbooks/manage-gitops-promoter.md.
Drizzle Kit silently skips SQL files lacking _journal.json entries. The db-migrate Sync hook on dev/staging runs npx drizzle-kit migrate, which globs drizzle/*.sql BUT only applies files with a matching entries[] tag in drizzle/meta/_journal.json. Job exits 0 either way — easy to miss. Always update the journal when adding a migration; older files in the repo (0006/0007/0020/0032/0037-0043) lack journal entries because their columns were applied via out-of-band paths historically. Prompt Workbench's resource_prompt_versions table is one of the checks to run after prompt-preset deploys. See runbooks/fix-drizzle-migration.md.
Two migration runners read from two different directories. src/lib/server/startup.ts reads from atlas/migrations/ (timestamp-prefixed); npx drizzle-kit migrate reads from drizzle/ (incremental + journal-gated). The production image's Dockerfile copies drizzle/ but .dockerignore excludes atlas/, so the atlas-runner is effectively only active in the ryzen Skaffold dev pod (which file-syncs source). New migrations usually need to live in BOTH dirs, both idempotent (ADD COLUMN IF NOT EXISTS).
Source-hydrator polls every ~3 min. After release metadata lands on origin/main, expect 5-8 min before dev's pod is rolling on the new image, then staging waits for its configured soak timer after health. argocd app refresh --hard triggers manifest re-render but does NOT immediately repoll branch tips. argocd app sync --revision <sha> is rejected on auto-sync + branch-tracking apps (Cannot sync to <sha>: auto-sync currently set to <branch>). Don't hard-sync; wait. See runbooks/track-promotion-state.md for what's-actually-stuck triage.
Generated env/spokes-* branches need guardrails. If the generated app directory drifts from env/spokes-*-next, use scripts/gitops/reconcile-spoke-generated-dir.sh <dev|staging> check|fix; do not hand-edit generated env branches unless the script proves the root and child dry SHAs match.
git status --porcelain R prefix means "renamed in INDEX, already staged". Filtering with grep -E "^A |^M " MISSES it. After a stale git add or interrupted commit, your next git commit will scoop in any pre-staged renames/deletes alongside what you intended. Before committing, either git reset HEAD -- to clear the index then re-stage exact paths, or use git diff --cached --name-status (which shows ALL staged changes including renames + deletes + mode changes).
SWE-env build cache lock needs THREE layers, not one. Task-swebench-inference-image-build-push.yaml acquires /var/lib/containers/.swebench-buildah.lock via mkdir. The shell trap releases on graceful exit but SIGKILL is uncatchable (OOMKill, eviction, controller force-delete bypass it). Tekton's retries:1 then spawns a retry pod that inherits the parent TaskRun name, so the dead pod's owner file says taskRun=<the same name> and the retry sees its own predecessor's lock as held by "another PR" → spin-polls forever → PR never terminates → Pipeline finally: never runs → deadlock. Fix is committed (stacks 52bb0b18 + f6f4bb00 + d450c9b1); know the symptom: retry-pod logs say Buildah cache lock is held by: taskRun=<own name> and original pod is OOMKilled. The 3 layers: (1) self-takeover in acquire_buildah_cache_lock — if owner taskRun matches context.taskRun.name, remove + reacquire; (2) Pipeline finally: task release-buildah-cache-lock — removes lock if owner starts with $(context.pipelineRun.name)-; (3) BUILDAH_CACHE_LOCK_STALE_SECONDS=1800 (was 21600 — 6h was way too long). Also: build-and-push memory bumped 2Gi req → 4Gi req / 6Gi limit so OOMKills become rare AND when they happen the kernel kills the offending container only (predictable). If a stale lock recurs, manual clear: spin a busybox pod with the buildah-cache-swebench-inference PVC mounted and rm -rf /cache/.swebench-buildah.lock.
K8s label values must match (([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])? — alphanumeric start AND end. Nanoid-generated IDs (workflow-builder benchmark runs, etc.) can legally end in _ or -. When a service uses run["id"] directly as a label value (e.g. swebench-coordinator's evaluator Job), the API rejects creation with HTTP 422 Invalid value: "<id>": a valid label must ... start and end with an alphanumeric character. Symptoms: run goes failed immediately at the evaluating stage with no useful inference error; harness eval never executes. Fix at the use site, not the ID generator (don't break existing data): trim outer [._-] characters and cap at 63 chars. Coordinator fix landed at workflow-builder 0f369b58 via _safe_label_value helper at lines 471/485 of services/swebench-coordinator/src/app.py. If you add a new service that labels K8s resources with run/instance IDs, mirror the helper.
SWE-bench random readiness is exact on the current environment identity. The Benchmarks page and POST /api/benchmarks/runs must use the same readiness logic as coordinator preflight. Static ConfigMap pins are ready when suite/repo/baseCommit/version and image digest match, even if environmentSetupCommit is absent. Dynamic DB build rows are ready only when environment_image_builds.env_spec_hash equals the current buildSwebenchEnvironmentSpec() hash. Do not fall back to repo/version/baseCommit-only DB matching; that admits old images and leaves runs parked in preflight while hub builds a new image.

Update (2026-05-19) — image delivery while the 2nd EL is still dead, + dev-portability of utility images

Ryzen sync is locally-driven. Ryzen's OWN argocd-application-controller reconciles root-ryzen @ main on its poll interval OR responds to refresh=hard annotations on the local ryzen-* Applications (--context admin@ryzen). Ryzen has its own local ArgoCD (autonomous agent); no local Gitea. Normal health target is every ryzen-* Application Synced/Healthy on ryzen's local ArgoCD.
Working ryzen image pattern: outer-loop GitHub lane builds + pushes ghcr.io/pittampalliorg/<img>:git-<sha>. To deliver to ryzen specifically, edit packages/components/workloads/<comp>/manifests/kustomization.yaml images: to newName: ghcr.io/pittampalliorg/<img> + newTag: git-<sha> and commit/merge to main (or use commit-pin.sh automatically, which pushes to main). Ryzen's local ArgoCD re-renders overlays/ryzen @ main and rolls the pod. Ryzen Deployments already have the ghcr-pull-credentials imagePullSecret materialized by ESO from KV GITHUB-PAT. Exception since C1 (2026-06-04): workflow-builder + workflow-mcp-server have NO bare images: block — commit-pin.sh workflow-builder writes the flat release-pins/workflow-builder-images-ryzen.yaml AND renders + commits the workflow-builder-ryzen-image Component LOCALLY in the same push (wfb PR #37; stacks CI is just a drift-net); do not hand-edit a newTag for those two (see the "two visible truths" gotcha).
Preserve workloads pins in git. Ryzen renders whatever workloads pin is committed to main. Image pins committed to main are the same pins dev/staging consume via their release-pins / Promoter outer-loop path.
dev-* apps source workloads/*/manifests/ @ origin/main HEAD (shared with ryzen, NOT a per-spoke render). dev-swebench-coordinator, dev-swebench-evaluator-tekton, dev-workflow-builder ArgoCD apps point at github.com/PittampalliOrg/stacks.git path packages/components/workloads/<comp>/manifests/ rev HEAD. So a commit to origin/main in workloads delivers to dev automatically; the spoke-workloads ApplicationSet additionally rewrites the release-pins workload images onto those apps' spec.source.kustomize.images (swebench-coordinator/evaluator/workflow-builder are release-pins keys — outer-loop update-stacks already bumps them). The base workloads pins are the ryzen value; dev's image comes from the release-pins override.
Utility/init images pinned to ryzen Gitea break dev/staging (Init:ImagePullBackOff). workloads/{swebench-coordinator,evaluation-coordinator}/manifests/kustomization.yaml rewrote bitnami/kubectl (+alpine/k8s) → gitea.cnoe.localtest.me:8443/giteaadmin/kubectl (ryzen's in-cluster Gitea, unreachable from dev/staging). The spoke ApplicationSet only rewrites release-pins workload images, NOT utility/init images, so those Deployments' wait-for-workflowstatestore init containers were permanently Init:ImagePullBackOff on dev/staging — they had never run there; only ryzen worked (local mirror). Fix (stacks #1707): mirror ryzen's kubectl image → ghcr.io/pittampalliorg/kubectl:latest (skopeo copy --src-tls-verify=false docker://gitea-ryzen.tail286401.ts.net/giteaadmin/kubectl:latest docker://ghcr.io/pittampalliorg/kubectl:latest, dest-auth = hub ghcr-push-credentials; as an agent use the SSH-wrapped ssh vpittamp@ryzen '…' form to dodge the bash-tool Production-Reads guard) and rewrite both kustomizations gitea→ghcr.io/pittampalliorg/kubectl (all-spoke; ryzen pulls GHCR fine). General rule: any workloads manifest that rewrites a utility image to gitea.cnoe.localtest.me:8443/... is dev/staging-broken by construction.
ArgoCD won't advance to a new commit despite autoSync/selfHeal? Symptom: app OutOfSync at the new rev, operationState.phase=Running … retrying attempt #N, a Deployment ProgressDeadlineExceeded behind a stuck old pod (e.g. the prior pod was Init:ImagePullBackOff). The stuck in-flight sync op blocks the new revision. Recovery: argocd app terminate-op <app> --grpc-web (login argocd-hub.tail286401.ts.net with argocd-initial-admin-secret). Terminate alone is usually sufficient — once the stuck op is killed, autoSync/selfHeal applies the current desired revision within ~1 min. argocd app sync --force typically keeps returning another operation is already in progress while the terminate winds down — don't fight it; wait for selfHeal. (runbooks/recover-stuck-promotion.md has the full procedure.)
Benchmark-run Dapr-lifecycle recovery lever: POST http://<bff>/api/internal/benchmarks/runs/<runId>/cleanup header x-internal-token: $INTERNAL_API_TOKEN body {} runs the documented terminal-cleanup teardown (its cascade is now generalized into — and shared with — the vetted Lifecycle Controller, src/lib/server/lifecycle/). DB-cancel alone does NOT terminate the durable Dapr session workflows (they keep re-spawning openshell sandboxes); the session-termination path only fires when the run is cancelled, so set runs+instances status='cancelled' first, then call cleanup (expect retries; coordinator+DB must be up). As a passive backstop, the lifecycle-terminal-reaper CronJob reconciles stuck rows on a timer, but it deliberately skips while a benchmark run/lease is active, so for a live run you still drive cleanup explicitly. See the evaluations skill's "System State Update (2026-05-19)".

Dev SWE-bench concurrency envelope

Practical limits for /workspaces/<slug>/benchmarks are layered. The current intended dev GitOps values are:

Layer	Knobs	Intended dev value
Launch/BFF default	`BENCHMARK_DEFAULT_CONCURRENCY`	`10`
Capacity mode	`BENCHMARK_CAPACITY_MODE`	`auto`
Execution backend/class	`BENCHMARK_EXECUTION_BACKEND` / `BENCHMARK_EXECUTION_CLASS`	`dapr-kueue` / `benchmark-fast`
Full-instance Kueue model	`BENCHMARK_KUEUE_INSTANCE_REQUEST_MODE`	`host-worker-composite`
Lease resources	`BENCHMARK_KUEUE_LEASE_RESOURCES`	`openshell_sandbox,dapr_workflow_slot`
Shared coding pool	`AGENT_RUNTIME_POOL_APP_IDS_JSON`	`agent-runtime-pool-coding`, `maxReplicas=16`, `slotsPerReplica=12` on dev
Dedicated coding fallback	`AGENT_RUNTIME_SLOTS_PER_REPLICA_JSON`	`coding=12`
Per-sidecar Dapr workflow cap	`AGENT_RUNTIME_DAPR_WORKFLOW_LIMIT_PER_SIDECAR`	`12`
Coordinator start pacing	`SWEBENCH_COORDINATOR_INSTANCE_START_BATCH_SIZE` / `...DELAY_SECONDS`	unset or `0` / `0` for full effective-concurrency fan-out
Evaluator parallelism	`SWEBENCH_EVAL_MAX_PARALLEL`	`24`

Per-instance peak draw during inference is modeled as the full sandbox/worker + agent-host bundle. The capacity snapshot fields kueueInstanceRequest*, kueueInstancePodCount, kueueAvailableInstanceSlots, and schedulableKueueInstanceCapacity are the deployment-time truth; do not infer safe concurrency from the launch slider or sandbox-only capacity. Per-run harness evaluation adds Kueue-admitted evaluator TaskRuns. Before raising a run, verify live node headroom, Kueue quota, Dapr runtime readiness, model/provider rate limits, and exact-ready image coverage.

Current clean dev checkpoint: run W4ZmHxaEMEYQDCZ_Ypo41 completed 25 distinct exact-ready SWE-bench_Verified instances with DeepSeek V4 Pro at maxTurns=25. It requested/effectively ran inference 25/25; evaluator requested/effective was 24/9 because Kueue clamped eval capacity. Result was 13 resolved / 7 unresolved / 5 empty-patch, zero evaluator errors, zero hard errors, zero active leases after cleanup, and no Dapr activity-registration failures. Treat 25 as proven; do not jump above it without a clean launch gate and exact-ready preview. Ryzen runs the same composite capacity model but has much less request headroom. The 2026-05-27 ryzen canary MPIlRkKWC7UdvHgwFQEiR selected 3 exact-ready instances and was correctly capped to effective concurrency 2 by kueue_capacity; all three instances inferred/evaluated and active leases returned to zero. Keep ryzen benchmark campaigns sequential even when a single run can safely use multiple effective slots.

What to read next

If the task is…	Read
New to the system / orienting	`reference/architecture.md`
Need kubectl / argocd on a cluster	`reference/access-paths.md`
Deciding whether an app belongs on hub or a spoke	`reference/app-placement.md`
Anything secret-related (rotation, audit, debugging)	`reference/secret-flow.md`
Bumping an image to dev/staging	`runbooks/promote-image-to-spokes.md`
Post-push workflow-builder rollout verification on ryzen/dev	`runbooks/track-promotion-state.md`
Prompt Workbench or prompt preset DB/API changes after rollout	`runbooks/track-promotion-state.md` + workflow-builder `references/prompt-workbench.md`
SWE-bench evaluator promotion or Benchmarks page canary validation	`shared-skills/evaluations/SKILL.md` + `runbooks/track-promotion-state.md`
Agent is silent after adding MCP/OAuth connection; ActivePieces piece MCP catalog, Knative KServices, per-session MCP bootstrap, or Dapr statestore scope is suspect	`runbooks/debug-workflow-builder-mcp-auth.md`
workflow-builder pod shows `1/2`, `daprd` readiness is false, or `openshell-agent-runtime` / `swebench-coordinator` is unavailable	`runbooks/debug-dapr-sidecar-stale-readiness.md`
workflow-builder works on ryzen but not dev/staging	`runbooks/reconcile-workflow-builder-spoke-environment.md`
Moving a ryzen-validated image to dev/staging	`runbooks/promote-image-to-spokes.md`
Image missing on ghcr.io (outer-loop build didn't run)	`runbooks/debug-funnel-orphan-tag.md`
`Validate Workflow Builder Release Pins` CI failing with `denied` on every image	`runbooks/grant-stacks-ghcr-package-access.md`
Bumping runtime images such as browser-use-agent-sandbox, dapr-agent-py-sandbox, or claude-agent-py-sandbox (per-session env-var images outside release-pins)	`runbooks/bump-image-pin-not-in-release-pins.md`
Editing a workflow JSON spec (maxTurns, prompt, agentKwargs, …) and rolling the change to dev/staging	`runbooks/upsert-workflow-json.md`
Upgrade GitOps Promoter or repair its ArgoCD UI extension	`runbooks/manage-gitops-promoter.md`
Review all OutOfSync/Degraded apps and decide keep vs remove	`runbooks/review-argocd-app-health.md`
ArgoCD operationState stuck Running	`runbooks/recover-stuck-promotion.md`
db-migrate Job stuck Terminating	`runbooks/recover-stuck-job-finalizer.md`
Webhook not firing / hub Tekton path broken (NXDOMAIN or 202-no-PipelineRun)	`runbooks/debug-funnel-orphan-tag.md`
Device-backed Tailscale Ingress missing address, using `-1`, or blocked by stale service/device records	`runbooks/debug-device-backed-tailscale-ingress.md`
ProxyGroup service-host missing address or cert domain	`runbooks/debug-proxygroup-service-host.md`
Migration shipped but columns missing on dev	`runbooks/fix-drizzle-migration.md`
Track a promotion in flight / what's gating it	`runbooks/track-promotion-state.md`
Spoke kubectl when Tailscale down	`runbooks/access-spoke-cluster-fallback.md`
Rotate a per-spoke OAuth client secret	`runbooks/rotate-oauth-secret.md`

The runbooks each follow the same shape: Symptoms → Diagnostic → Fix steps → Verify.

CLIs the agent should assume are available

Tool	Typical use here
`kubectl`	Multi-context via `~/.kube/config`; for hub use `--kubeconfig ~/.kube/hub-config` (no SSH wrapper when on ryzen)
`argocd`	Login via Tailscale: `argocd login argocd-hub.tail286401.ts.net --grpc-web` (admin password in `argocd-initial-admin-secret`). Use for `terminate-op`, `app sync --force`, things kubectl-patch can't do
`gh`	GitHub API (webhook delivery history, OAuth app metadata, PR/run inspection); already authenticated as `vpittamp`
`op`	1Password CLI — the hub secret root since 2026-06: `op read 'op://hub-eso/<item>/<field>'` reads hub secrets; `op://CLI/<id>/credential` holds the `hub-eso-reader` SA token. Replaces `az` for hub secrets
`az`	Azure KeyVault (`keyvault-thcmfmoo5oeow`) — DORMANT for the hub (Azure KV + AD App + OIDC/JWKS not deleted but no longer in the hub recreate path); still relevant for spoke OAuth/cert material. `az keyvault secret show --query attributes.updated -o tsv` for rotation-time audits
`skopeo`	Legacy: mirror images from ryzen's local Gitea PVC to ghcr.io (only needed for unrecovered pre-A6 artifacts). Use `--dest-authfile` with hub's `ghcr-push-credentials` secret. Run from ryzen (DNS)
`talosctl`	Hub: `--talosconfig ~/.talos/hub-config`; Talos cluster (Hetzner): `~/.talos/talos-config`. Spokes don't have ready-made talosconfig — use kubeconfig fallback
`hcloud`	Active context `stacks` (`hcloud context list`). `hcloud server list` for full Hetzner topology
`tailscale`	`status --json` for orphan-tag diagnosis; `serve status` / `funnel status` from inside operator pods
`git`	Push app/source repos and promoted stacks changes to `origin` unless a task is explicitly about historical Gitea recovery

Repo paths cheat-sheet

Path	Role
`packages/components/hub-spoke-appsets/release-pins/workflow-builder-images.yaml`	dev/staging release metadata source; edit with the generated overlays
`packages/components/workloads/workflow-builder-system-overlays/{dev,staging}/kustomization.yaml`	generated dry-source overlays consumed by source-hydrator; contains release-pin images and per-spoke runtime env
`packages/components/hub-spoke-appsets/apps/spoke-workloads-appset.yaml`	The cluster-selecting ApplicationSet that points source-hydrator at each generated workflow-builder-system overlay
`packages/base/manifests/tailscale-ingresses/`	Shared `-CLUSTER`-placeholder tailnet exposures for promoted-spoke app hostnames: device-backed Tailscale Ingresses (e.g. `phoenix-`) plus the workflow-builder L4 LoadBalancer `Service-workflow-builder-tailnet.yaml` (PR #2319)
`packages/components/workloads/<image>/manifests/kustomization.yaml`	Per-image workloads kustomization; current ryzen delivery uses ghcr.io refs reconciled by ryzen's local ArgoCD (`root-ryzen`) from `main`. Exception: workflow-builder + workflow-mcp-server — bare `images:` deleted (C1); ryzen pin lives in the rendered Component below
`packages/components/workloads/workflow-builder-ryzen-image/kustomization.yaml`	Render-generated kustomize Component carrying the ryzen workflow-builder + workflow-mcp-server image pin (newName/newTag); `components:`-included by `workflow-builder/manifests/kustomization.yaml`. `commit-pin.sh` RENDERS + COMMITS it LOCALLY (wfb PR #37); stacks CI `.github/workflows/render-ryzen-image.yml` is a drift-net that re-renders on a diff — do NOT hand-edit
`packages/components/hub-spoke-appsets/release-pins/workflow-builder-images-ryzen.yaml`	Flat ryzen pins file (images/imageRefs/digests/sourceShas) that `commit-pin.sh` upserts for workflow-builder + workflow-mcp-server; commit-pin renders the Component above from it locally in the same push (the CI render-net re-derives it too). The render consumes ONLY the workflow-builder + workflow-mcp-server rows — other services' rows here are inert
`.github/workflows/render-ryzen-image.yml`	stacks CI drift-net: re-renders `workflow-builder-ryzen-image` from the flat ryzen pins file on each touching push (runs `render-workflow-builder-release-overlays.sh` with `WFB_RENDER_ENVS=ryzen`), commits only on a diff — NO-OPs when commit-pin's local render already matches
`packages/components/workloads/workflow-builder/manifests/Deployment-workflow-builder.yaml`	`AGENT_RUNTIME_*_DEFAULT_IMAGE` env vars that the BFF reads per-session to select the agent-sandbox Sandbox pod image (separate from release-pins; see gotchas)
`packages/components/workloads/activepieces-mcps/manifests/`	All-catalog reconciler that turns workflow-builder DB `mcp_connection` rows (+ pinned pieces) into cluster-local per-piece `ap-<piece>-service` piece-runtime KServices (one `piece-mcp-server` image / `PIECE_NAME`, serving `/execute` + `/mcp` + `/options` + `/health`; needs `NODE_OPTIONS=--max-old-space-size=400` + `512Mi`) and `activepieces-mcp-catalog`
`packages/base/manifests/knative-serving/kustomization.yaml`	Knative Serving install and autoscaler config, including `allow-zero-initial-scale` needed by generated piece-runtime services
`packages/components/workl

Content truncated for page performance. Open the source repository for the full SKILL.md file.