name: gitops description: "Use this skill for PittampalliOrg/stacks GitOps operations: ArgoCD app health/drift review across hub/dev/staging/ryzen; argocd-agent spoke registration/status aggregation; image promotion; release-pins, GHCR image drift, and image pins outside release-pins; SWE-bench evaluator image/env rollout and canary validation; workflow-builder spoke runtime drift; workflow-builder prompt-preset DB migrations and API smoke tests; workflow-builder MCP/auth and ActivePieces piece-runtime (MCP + activity execution) services; Dapr agent-runtime statestore, runtime-registry-driven runtime image/env pins (dapr-agent-py / claude-agent-py / browser-use-agent), sidecar readiness, and 1/2 pod recovery; GitOps Promoter stuck apps, env branches, source-hydrator, and hub promotion; Tailscale ACLs, device-backed Ingress DNS/status, ProxyGroup service-host VIPs, spoke API access, stale tailnet devices/services, and Funnel webhooks; OAuth/secret rotation, deployment inventory, workflow JSON DB upserts, and app placement."
GitOps for PittampalliOrg/stacks
Operational knowledge for the hub-and-spoke gitops system across dev, staging, ryzen (local Talos Docker, autonomous argocd-agent spoke), and hub (Talos control plane). Read this whole file, then drill into reference/ or runbooks/ based on the decision tree.
Orientation
- Hub is a Talos cluster on Hetzner. It runs a single ArgoCD that manages itself and all spokes via cluster secrets.
- Control plane = argocd-agent v0.8.1. The hub runs the PRINCIPAL (single pane, ns
argocd); each spoke runs a LOCAL ArgoCD + an agent dialing the principal OUTBOUND over tailnet mTLS (8443). dev = MANAGED agent (hub authors Application objects in nsdev, the principal pushes them to the dev agent; observe viakubectl -n dev get applicationson the hub). ryzen = AUTONOMOUS agent (reconciles its own apps; hub aggregates status). Sync OPERATIONS run on the SPOKE's local controller, so the hub pane shows sync+health but NOT operation lifecycle — "Unknown operation status" on the hub is architectural/benign. For migrated spokes the cluster-<spoke>Secret is now an AGENT MAPPING (server=...?agentName=<x>+ embedded mTLScertData/keyData, NO bearerToken). Seereference/architecture.md("Spoke registration"),cluster-desired-statefor the end-to-end recreate path, andcluster-desired-state/references/tailscale-and-certs.mdfor cert-avoidance detail. - Spokes:
dev,staging(script-provisioned Talos on Hetzner — Crossplane removed in Phase D), andryzen(imperatively-bootstrapped local Talos-in-Docker on the user's workstation). All three run a LOCAL ArgoCD + an argocd-agent dialing the hub principal. dev/staging = MANAGED (hub authors their Applications and pushes them down). ryzen = AUTONOMOUS — it runs a LOCAL ArgoCD whoseroot-ryzenapp-of-apps reconcilespackages/overlays/ryzen@mainDIRECTLY (theryzen-*Applications live on ryzen's own cluster, NOT on the hub; the agent push-mirrors their status up to hub nsryzen). Ryzen has no local Gitea and no idpbuilder — GitHub + GHCR only. - Hub Tekton owns the build plane:
- Outer-loop lane (
github-outer-loopEL): auto-builds + promotes ALL Skaffold-owned services to dev on merge tomain— NOT just workflow-builder. The hub Tekton EventListenergithub-outer-loophas per-service triggers (CEL filter: a commit touchingservices/<svc>/**, or a[build all]commit message). A merge that touchesservices/<svc>/fires THAT service's trigger → the parameterized, service-agnosticouter-loop-buildPipeline builds the image → ghcr.io → theupdate-stackstask pins the SHAREDrelease-pins/workflow-builder-images.yaml(one file holds EVERY service's pin) + regenerates the dev overlaypackages/components/workloads/workflow-builder-system-overlays/dev/kustomization.yaml. VERIFIED end-to-end 2026-06-05 for workflow-builder, workflow-orchestrator, function-router, mcp-gateway, swebench-coordinator (the prior "only workflow-builder/swebench-inference auto-builds" belief was WRONG — the per-service triggers had simply never been exercised, so they looked dead). workflow-builder's own trigger fires onsrc/,lib/,scripts/,static/,drizzle/,Dockerfile,package.jsonchanges. The renderer default is dev-only (staging dormant — stacks PR #2437 flippedWFB_RENDER_ENVSdefaultdev staging→dev; re-enable withWFB_RENDER_ENVS="dev staging"). github-outer-loop deliberately does NOT touch ryzen's pin (ryzen is the Skaffold commit-pin lane). Current pipelines may push the release metadata commit directly toorigin/main; older/alternate pipelines may open arelease/workflow-builder-*release-intent PR. Inspectupdate-stackslogs and branch/PR state before assuming the handoff. A release metadata + generated overlay commit onorigin/maindrives dev.update-stackspush retry now has backoff (stacks #2455). The task'sgit push origin mainretry is 6 attempts at 4/8/12/16/20s with a rebase between (was 3 tries in ~1s with NO backoff, which DROPPED a build's promotion on a transient GitHub 500 / push contention — e.g. a build racing a concurrent merge). Transient push failures now self-heal; this was the only real gap in the per-service auto-build.- Bring a STALE service current without a source change. Because the per-service trigger only fires on a
services/<svc>/change, a service can stay frozen at its last successful build indefinitely. To re-pin from currentmainHEAD, create a PipelineRun from theouter-loop-buildPipeline with paramsgit_url=https://github.com/PittampalliOrg/workflow-builder.git,git_sha=<current main HEAD>,image_name=<svc>,dockerfile=services/<svc>/Dockerfile,context=.(Node: function-router/mcp-gateway) orservices/<svc>(Python: workflow-orchestrator/swebench-coordinator), + workspacesshared-workspace(emptyDir),dockerconfig(secretghcr-push-credentials),buildah-cache(PVCbuildah-cache-<svc>). Per-serviceimage_name/dockerfile/contextcome from theouter-loop-<svc>TriggerBinding. It builds from current main →update-stacksre-pins dev. (Used 2026-06-05 to bring mcp-gateway/swebench-coordinator/function-router current.)
- Ryzen local delivery: source changes still land through the normal app repo / GitHub path, and current image delivery should use GHCR tags or explicit stacks pins. Ryzen manifest updates are delivered by committing to
main— ryzen's LOCAL ArgoCD (root-ryzen@main) re-renderspackages/overlays/ryzendirectly (no hub Source Hydrator, no Promoter, noenv/spokes-ryzenon the ryzen lane). - SWE-bench inference image lane: submitted by workflow-builder/swebench-coordinator preflight as
swe-env-<envSpecHash-prefix>PipelineRuns on hub Tekton. These validate and pin repo/version/base images intoSWEBENCH_INFERENCE_ENVIRONMENTS_DIRConfigMap data. If a dev SWE-bench run sitsqueuedwhile aswe-env-*PipelineRun exists on hub, the run is waiting for hub environment validation; do not look for Buildah pods on dev. The supported lane is the organic harness-generated image path; stale Epoch/prebuilt experiment rows or PipelineRuns must not satisfy exact-ready selection unless a fresh compatibility canary proves that strategy. - A workflow-builder app push that touches
services/dapr-agent-pynormally fires three runtime image builds:dapr-agent-py-image-build,dapr-agent-py-sandbox-image-build, anddapr-agent-py-testing-sandbox-image-build. It can also fire the workflow-builder image build from the same commit. Watch the GitHub/GHCR outer-loop PipelineRuns and release metadata before deciding a rollout is complete. - A workflow-builder app push that touches
services/claude-agent-pyis the Claude Agent SDK peer-runtime path. Verify the builtghcr.io/pittampalliorg/claude-agent-py-sandbox:git-<sha>tag, the workflow-builder Deployment envAGENT_RUNTIME_CLAUDE_DEFAULT_IMAGE, and the live BFF pod env before declaring Claude runtime rollout complete. - For SWE-bench infra work, a
dapr-agent-pychange is not live until dev runs the matching sandbox/testing sandbox images and the BFF sees the updatedAGENT_RUNTIME_*_DEFAULT_IMAGEvalues. A recent scoped-activity fix validated this path with workflow-builder commit0180f081and a clean 25-instance dev run; use the pattern, not the SHA, as the rule. - Old spoke-local build apps such as
workflow-builder-builds-localandgitea-builds-egressshould stay removed. If you see them live, treat them as stale/orphaned unless a new design explicitly reintroduces them.
- Outer-loop lane (
- GitOps Promoter gates hub and spoke promotions. No staging cluster currently — promotion is ryzen (direct
main) + dev only.workflow-builder-releasegates promotion to dev throughargocd-healthplus thetimergate fromTimedCommitStatus-workflow-builder-soak.yaml(dev0s),autoMerge: true. Theenv/spokes-stagingenvironment + its10msoak were dropped (stacks PR #2436); staging is dormant — thecluster-stagingSecret,env/*staging*branches, the staging overlay, and thespoke-workloads-appsetstaging entry are kept in place for fast re-enable (re-add the env toPromotionStrategy-workflow-builder-release.yaml+TimedCommitStatus-workflow-builder-soak.yamlto restore).stacks-environmentsgates hub self-management fromenv/hub-next→env/hub. - ArgoCD v3.4 + Promoter v0.30 (May 2026). Bumped from 3.3.9/v0.27.1. v3.4 has stricter ServerSideApply that surfaces operator-injected drift previously hidden — most commonly seen on Tekton Pipelines/Tasks (mutating webhook adds
computeResources: {},metadata: {}, etc.) and Knative Services (terminationGracePeriodSecondsrequires a feature flag). UseignoreDifferenceswithjqPathExpressionscovering the operator-injected paths; seerunbooks/review-argocd-app-health.md. Also enabled the web-based terminal (exec.enabled=true) — Pods now have a Terminal tab in the UI. - ArgoCD Promoter UI extension is installed on hub ArgoCD so operators can visualize
PromotionStrategy,ChangeTransferPolicy,PullRequest, and related Promoter CRDs in the ArgoCD UI. - Source-hydrator renders
packages/overlays/<spoke>→env/spokes-<spoke>-next; promoter merges toenv/spokes-<spoke>; the hub principal pushes the generated spoke Applications down to the managed agents. This applies to dev/staging only — ryzen is NOT on the source-hydrator/Promoter path (it reconcilespackages/overlays/ryzen@maindirectly via its own local ArgoCD; there is noenv/spokes-ryzen). - Deployment inventory is generated on the hub by
gitops-deployment-inventory. Browser/human access uses the HTTPS service-host VIPgitops-inventory-hub.tail286401.ts.net; spoke workflow-builder pods use a separate node-backed Tailscale LoadBalancergitops-inventory-hub-node.tail286401.ts.net:8080through the in-cluster egress Servicegitops-inventory-hub-egress.tailscale.svc.cluster.local:8080. The generator now lists Argo Applications cluster-wide (was ns=argocdonly) so dev/staging health populates instead of showing "Unknown" (#2445). The GitOps pipeline image-history carousel needs aGITHUB_TOKENon the workflow-builder pod (else GitHub's 60/hr unauthenticated limit emptiesloadPinHistory) — wired via theworkflow-builder-secretsExternalSecret reading KVryzen-shared-secrets/GITHUB-PAT(#2444). - Promoted spoke hostnames are declarative. Dev/staging workflow-builder system URLs live in
spoke-workloads-appset.yaml, tailnet exposures live underpackages/base/manifests/tailscale-ingresses/(device-backed Ingresses likephoenix-*plus the workflow-builder L4 LoadBalancerService-workflow-builder-tailnet.yaml), andpolicy.hujsonis reserved for tailnet policy such as realsvc:*service-host approvals, device tags, Funnel grants, and Kubernetes grants. - OpenShell depends on upstream kubernetes-sigs/agent-sandbox CRDs. Keep
agent-sandbox-crds/<spoke>-agent-sandbox-crds; it owns requiredSandbox,SandboxClaim,SandboxTemplate, andSandboxWarmPoolCRDs. The customAgentRuntimeCRD + the Kopfagent-runtime-controllerare RETIRED — runtimes are now per-session agent-sandbox Sandbox pods (Kueue-admitted) selected by image, withbrowser-use-agenton aSandboxWarmPoolcarve-out. AutoKube is legacy and has been removed. - Workflow-builder MCP/auth is a DB-backed runtime path, not just static manifests. Project MCP rows in workflow-builder's
mcp_connectiontable bind ActivePieces pieces toapp_connection.external_idcredentials. Theactivepieces-mcpsapp is an all-catalog reconciler (every 2 min) that provisions per-piece Knativeap-<piece>-serviceservices from enabledmcp_connectionrows + pinned pieces, plus anactivepieces-mcp-catalogConfigMap. Eachap-<piece>-serviceis the converged piece-runtime — ONEpiece-mcp-serverimage parameterized byPIECE_NAME, servingPOST /execute(deterministic workflow activities —fn-activepieceswas deleted),POST /mcp(agent MCP tools),POST /options(canvas dropdowns), andGET /health. Credential reference-forwarding applies to BOTH/mcptools and/executeactivities (callers forwardX-Connection-External-Id; the piece-runtime self-resolves via the BFF decrypt). At launch the BFF resolves the agent's MCP servers from itsagentConfig.mcpServersand passes them into the per-session runtime asDAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON(claude-agent-py now consumes MCP too) — there is no per-agent CR or controller injecting them. - Sandboxed Dapr agents use centralized Dapr state.
workflowstatestoreis the namespace-wide workflow/actor state store for parent workflows, child session workflows, timers, reminders, and activity bookkeeping.dapr-agent-py-statestoreis namespace-wide too, butactorStateStore=false; it is the agent application state API store. Do not create per-agent state stores or move durable history into pod-local state. - SWE-bench evaluator rollout is a workflow-builder image-promotion path plus an env-var check. The evaluator image is built from the workflow-builder repo (
services/swebench-evaluator) and promoted through release-pins, butswebench-coordinatorlaunches Jobs fromSWEBENCH_EVALUATOR_IMAGE. The base kustomize has a replacement from local-configPod/swebench-evaluator-imageinto that env var; verify the live Deployment env after promotion instead of assumingimages:alone rewrote it. - Claude runtime rollout is workflow-builder image-promotion plus a BFF env-var check.
claude-agent-py-sandboxis surfaced in the workflow-builder GitOps pipeline UI service matrix and is consumed throughAGENT_RUNTIME_CLAUDE_DEFAULT_IMAGE, not by the workspacesandboxTemplatename. Also verifyCLAUDE_AGENT_PY_DEFAULT_MODEL=claude-opus-4-8and the workflow-builder model keyanthropic/claude-opus-4-8when changing defaults.
The two image-pin systems for the same workflow-builder base are the most common source of confusion. Read reference/architecture.md first if you've never seen this setup.
Event-driven activity stream (Argo Events → workflow-builder GitOps pipeline UI)
The whole delivery system is observable LIVE in workflow-builder at /admin/gitops/system (the "Kargo lens" pipeline view), fed by Argo Events on the hub. This is the fastest way to watch a build/promotion/sync land across ryzen + dev in real time.
- Producers — stacks
packages/components/hub-management/manifests/gitops-activity-events/(ArgoCD appapps/gitops-activity-events.yaml): an Argo EventsEventBus+ three resourceEventSources watch hub CRs — ArgoCDApplications (EventSource-gitops-argocd), TektonPipelineRun/TaskRuns (-tekton), GitOps-Promoter CRs (-promoter). MatchingSensors (Sensor-gitops-{argocd,tekton,promoter}+-inventory-refresh) fire on each change and HTTP-POST a normalized event to workflow-builder's internal ingest endpoint. Theargo-events-uimanifests expose the raw Argo Events dashboard. - Ingest — workflow-builder BFF:
POST /api/internal/gitops/events/ingest→src/lib/server/gitops/activity-events.tsclassifiessource(tekton/promoter/argocd), normalizesresourceRef, extracts acorrelationmap (imageName, imageRef, gitSha, argocdApp, syncRevision/syncStatus/healthStatus, branch, hydratedSha, drySha, PR, commitStatusKey…), computes a deterministiceventId, and appends to thegitops_activity_eventstable with a monotonicsequence. - Consumer — the UI:
/admin/gitops/systemopens an SSE streamGET /api/v1/gitops/events/stream?since=<seq>(sequence-resume + exponential-backoff reconnect + fallback poll only while disconnected).src/lib/gitops/activity-overlay.tsCORRELATES events onto the pipeline model — Tekton by imageName/imageRef/gitSha, Promoter → therelease-pinsbundle stage by branch/hydratedSha, ArgoCD by<env>-<app>name. The overlay attachesactivityONLY; it never mutates the authoritative inventory (health/sync) snapshot. Graph nodes / list / per-node drawer render event-first (failed/passing/active/neutral tone from one sharedactivity-tone.ts, freshness from a shared clock, a brief node + edge "flow" pulse on each incoming batch); the raw firehose is behind?debug=1. - Reading it during a deploy: the header
build <sha>badge shows the image THAT cluster's UI pod is itself running, so it doubles as the per-cluster delivery proof —build <ryzen-sha>onworkflow-builder-ryzen.tail286401.ts.net, the promoted sha on the dev URL. Inventory/health remains the authoritative recovery snapshot; the event stream is the live overlay on top of it. - Build feedback + delivery timeline (INVENTORY-sourced, NOT event-sourced — 2026-06-05). The pipeline view also surfaces image-build status and the whole Commit→Build→Pin→Promote→Deploy chain, but it reads these from the hub inventory, not the Argo-Events stream. Why: the activity stream is ~100% ArgoCD (Tekton events are buried out of any practical window), whereas
gitops-deployment-inventoryalready aggregatesenvironments[].applications[].build = {pipelineRun, status, reason, startedAt, finishedAt}(+desired/promotion/live/drift) per app, refreshed ~15s. So the model threadsbuild(→ a build chip on each stage card: Built/Building/Failed + duration + a Tekton PipelineRun deep-link to nstekton-pipelines) andimageHistoryprovenance (commit/pin) — no Tekton TaskRun triggers were added; the Tekton EventSource is unchanged. The node drawer's Delivery timeline shows inter-step gaps (↓ +1m), phase durations (build, soak), acommit→livelead-time header, and a single "live since" on Deploy — a per-row "N mins ago" would collapse to one value because the automated outer-loop runs as one sub-minute burst (durations/gaps carry the signal instead). Lane-aware Promote: dev shows the promoted hydrated sha + soak/gate; ryzen shows "direct to main · no Promoter gate" (the pin IS the promotion). Data quirk to know:imageHistory.committedAtis actually the pin-commit time (from the stacks release-pins git log), so Commit≈Pin and the lead-time anchors onbuild.startedAt(the genuinely-earliest event). Lesson for any fast automated pipeline: show durations/gaps/lead-time, not repeated absolute timestamps. - App-wide deployment notifications (toast + sidebar bell — 2026-06-05). Beyond the pipeline page, workflow-builder now notifies on EVERY authenticated page when an image actually REPLACES a deployment (a component's live image tag changes on a cluster — the "your change is live" signal): a svelte-sonner toast + a sidebar notification bell (unread badge + localStorage history). It's admin-gated (the inventory/SSE endpoints require platform-admin) and INVENTORY-DIFF-driven (same philosophy as the build feedback above): a singleton store (
src/lib/stores/deployment-notifications.svelte.ts, started once from the root layoutonMount) baselines eachenv:component's SET of live image tags from/api/v1/gitops/deployment-metadataand fires when a genuinely NEW tag appears whileSynced. The gitops SSE stream is only a debounced re-check trigger; a 25s poll is the fallback. Detection gotcha (verify if you touch it): the inventory'slive.imagesmid-rollout holds BOTH the old AND new component tags (old+new ReplicaSet pods coexist) anddesired.imageis a full ref WITH a tag, not a bare repo — so use a tag-SET diff (current − baseline), not a single "current tag", or the old tag wins and nothing fires. See the [[project_app_wide_deploy_notifications]] memory. To test a notification without a full build, commit-pin a service on ryzen to an existing GHCR tag (e.g.SKAFFOLD_IMAGE=ghcr.io/pittampalliorg/<svc>:git-<sha> bash skaffold/hooks/commit-pin.sh <svc>) — the live-image change fires it.
The "which file?" matrix (single most-referenced piece of knowledge)
| Cluster | Image source | Bump path | Branch the bump lands on |
|---|---|---|---|
| ryzen | packages/components/workloads/<image>/manifests/kustomization.yaml (images: block) or release-pinned GHCR refs for shared workload images |
Edit stacks, commit/merge to main (or run commit-pin.sh, which now pushes to main). Ryzen's LOCAL ArgoCD (root-ryzen @ main) re-renders packages/overlays/ryzen directly; hard-refresh the affected ryzen-* apps (or deployment/scripts/ryzen-sync.sh) to skip the poll interval |
GitHub PittampalliOrg/stacks main (no inner-loop, no env/spokes-ryzen — both retired) |
| dev / staging | packages/components/hub-spoke-appsets/release-pins/workflow-builder-images.yaml (images compatibility tags plus digests, imageRefs, sourceShas, pipelineRuns, updatedAts) rendered into dry-source overlays at packages/components/workloads/workflow-builder-system-overlays/{dev,staging}/kustomization.yaml |
Hub Tekton outer-loop update-stacks writes release metadata and regenerates overlays; observed current path can push directly to origin/main, while PR-mode opens release/workflow-builder-*. Manual changes must update/validate the same metadata and overlays |
origin/main release metadata commit, or release/workflow-builder-* PR branch → origin/main when PR mode is active |
| hub itself | source-hydrator from packages/overlays/hub on origin/main → env/hub-next → env/hub (gated by stacks-environments PromotionStrategy) |
Edit overlay; merge to origin/main |
origin/main (GitHub) |
The static dapr-agent-py pool Deployment (backing the openshell-durable-agent enum + the agent-runtime-pool-coding benchmark pool) is a third path: bumped via its workloads image pin (no per-cluster override for the manifest), so a single bump applies to all spokes once it's on origin/main. (The old custom agent-runtime-controller Kopf operator that materialized per-agent Deployments is retired — there is no Deployment-agent-runtime-controller.yaml to bump.)
The ryzen row's bare-images: mechanism above holds for most Skaffold-owned services (workflow-orchestrator, function-router, mcp-gateway — commit-pin edits their newTag directly), but workflow-builder + workflow-mcp-server are an exception since C1 (2026-06-04): their bare images: block was deleted; ryzen's pin now lives in the render-generated Component packages/components/workloads/workflow-builder-ryzen-image/kustomization.yaml, which their workloads kustomization components:-includes. commit-pin for those two writes the FLAT file packages/components/hub-spoke-appsets/release-pins/workflow-builder-images-ryzen.yaml AND renders + commits the Component LOCALLY in the same push (wfb PR #37); stacks CI (.github/workflows/render-ryzen-image.yml) is just a drift-correction safety net that re-renders only on a diff. See the "Workflow-builder image pin has two visible truths" gotcha for the full flow.
release-pins/workflow-builder-images.yaml is the image-pin source for promoted dev/staging workflow-builder-system child Applications, but it is not applied directly by the ApplicationSet. It is rendered into the dry-source overlays with scripts/gitops/render-workflow-builder-release-overlays.sh, and source-hydrator reads those overlays. Manual release-pin edits must run the renderer or scripts/gitops/validate-workflow-builder-release-pins.sh will fail the overlay freshness check.
- Do not add release-pin lookups back into
spoke-workloads-appset.yaml. Argo CD source-hydrator caches by dry-source commit; when rendered output depends on ApplicationSet generator values outside the dry source, a controller race can hydrate the right dry SHA with stale inline values and then keep reusing that hydrated commit. Keep release-pin-derived images/env in the generated dry-source overlays instead. - Runtime images (
browser-use-agent-sandbox,dapr-agent-py-sandbox,claude-agent-py-sandbox) are read fromAGENT_RUNTIME_*_DEFAULT_IMAGEenv vars on the workflow-builder Deployment (packages/components/workloads/workflow-builder/manifests/Deployment-workflow-builder.yaml) or launch-specific runtime config; the runtime registry SSOT (services/shared/runtime-registry.json) maps each runtime to itsimageEnvKey.kustomize.imagessubstitutes containerimage:fields but not env var values, so release-pins bumps don't touch these. Bump the env var and verify the live BFF pod sees it — the BFF reads it per session when it spawns the agent-sandbox Sandbox pod, so the next session uses the new image (no per-agentAgentRuntimeCR to patch; the CRD + controller are retired). For the staticdapr-agent-pypool, roll its Deployment. Seerunbooks/bump-image-pin-not-in-release-pins.md.
Ryzen is an AUTONOMOUS argocd-agent spoke with its OWN local ArgoCD; no local Gitea, no idpbuilder (GitHub + GHCR only). Ryzen-affecting manifest changes flow through GitHub main:
- Ryzen-only image-tag bumps (Skaffold outer-loop): commit-pin.sh pushes to GitHub
main. Ryzen's LOCAL ArgoCD (root-ryzen@main) re-renders and applies the bump — NO source-hydrator, NO Promoter, NOenv/spokes-ryzenfor this path (all retired for ryzen). - Manifest changes affecting hub itself (cluster Secrets, ApplicationSet definitions, headlamp Service annotations): commit to GitHub
main. Hub Source Hydrator hydratespackages/overlays/hubtoenv/hub-next. GitOps Promoter then createsenv/hub-next → env/hubPRs that MUST be merged for the change to take effect. - Manifest changes affecting dev/staging workload-layer: commit to
main. Source-hydrator renderspackages/overlays/<spoke>toenv/spokes-<spoke>-next; Promoter gates theenv/spokes-dev-next → env/spokes-devstep; the principal then pushes the Applications to the managed dev/staging agents. (Ryzen is NOT on this path — it reconcilesoverlays/ryzen@mainitself.)
Hub→ryzen kube-api reach (the ryzen Tailscale operator's apiserver-proxy SNI path, or the ryzen host raw-TCP tailscale serve --tcp=6443 passthrough) is RETIRED as the ArgoCD sync path — under argocd-agent ryzen reconciles its own apps locally and the agent dials the hub principal OUTBOUND (8443). The hub→ryzen kube endpoint now exists ONLY for Headlamp (the host-passthrough endpoint in the dedicated headlamp.dev/cluster=true Secret). The cluster-ryzen Secret is an AGENT MAPPING (server=https://argocd-agent-resource-proxy:9090?agentName=ryzen + embedded mTLS, NO bearerToken), not a kube-API endpoint. See reference/architecture.md ("Control plane: argocd-agent v0.8.1" and "Spoke registration"), cluster-desired-state for the recreate path, and the ryzen-spoke-bootstrap skill's references/failure-modes.md.
For hot-loop regression checks, use deployment/scripts/benchmark-ryzen-hot-edit.sh with BENCHMARK_PURPOSE=normal|manual|threshold-test and BENCHMARK_CASE=child-service|app-definition|dependency-file. The app-definition case uses a source-only child Application marker so it exercises root/app-definition planning without leaving live Application fields behind. The summary command defaults to --purpose normal and excludes failed threshold-test reports; use --purpose all --include-failures when auditing full history.
MCP/auth has a third, non-image flow. mcp_connection and app_connection rows live in the workflow-builder DB; activepieces-mcps reconciles those rows into Knative services; at session launch the BFF resolves the agent's MCP servers and passes them into the per-session runtime pod as DAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON (read per-session — no per-agent CR caches it). A source push alone does not fix an agent whose stored MCP config is stale; fix the mcp_connection/agentConfig.mcpServers rows, then launch a fresh session and verify the runtime pod env and logs show the expected servers.
Decision tree
"I need to roll out / promote / bump an image"
- Which cluster? ryzen only → update the workloads manifest or GHCR image pin in stacks, then commit/merge to
main(or runcommit-pin.sh, which pushes tomain) — ryzen's local ArgoCD picks it up. If the image does not exist yet, push the app repo toorigin/mainso the normal GitHub/GHCR outer-loop builds it first. - dev or staging → normal path is hub Tekton outer-loop builds GHCR and
update-stackswrites tag, digest, provenance, and generated dry-source overlays. Read the task logs: if it pushed directly toorigin/main, track source-hydrator + Promoter from that commit; if it opened arelease/workflow-builder-*PR, review/merge it first. Manual path: edit all release metadata maps inrelease-pins/workflow-builder-images.yaml, runscripts/gitops/render-workflow-builder-release-overlays.sh, verify the GHCR tag/digest, runscripts/gitops/validate-workflow-builder-release-pins.sh, then followrunbooks/promote-image-to-spokes.md. - Want dev/staging to use an image you validated on ryzen → use the GHCR tag/digest as the promoted artifact, then bump release-pins and generated overlays. Legacy Gitea registry mirroring is recovery-only; it is not the normal source of promoted images.
"I pushed workflow-builder and need to verify ryzen + dev"
- Confirm the app repo commit is on
origin/mainso the hub Tekton outer-loop builds the newghcr.io/pittampalliorg/workflow-builder:git-<sha>image. - Ryzen: for fast iteration,
commit-pin.shalready pushed the new tag to GitHubmain— ryzen's LOCAL ArgoCD picks it up automatically. Verify withkubectl --context admin@ryzen get application ryzen-workflow-builder -n argocd(should beSynced/Healthy) and confirm the live Deployment image on ryzen viakubectl --context admin@ryzen get deploy workflow-builder -n workflow-builder -o jsonpath='{.spec.template.spec.containers[0].image}'. During activeskaffold devsessions the Skaffold-owned dev pod may serve live traffic from synced source while the local ArgoCD app is paused (skip-reconcile); inspect the live pod before assuming the image rollout is what users hit. - Dev: watch hub
outer-loop-workflow-builder-*; capture the built GHCR tag/digest and readupdate-stackslogs. The task may push release metadata directly toorigin/mainor open a release PR depending on the active pipeline. - Track
spoke-dev-workflow-builder.status.sourceHydrator.currentOperation.{drySHA,hydratedSHA}, theworkflow-builder-release-env-spokes-dev-*ChangeTransferPolicy, anddev-workflow-builder/spoke-dev-workflow-builderhealth. Ifenv/spokes-dev-nextadvanced but the CTP still proposes the older dry SHA after one source-hydrator poll, annotatePromotionStrategy/workflow-builder-releaseand the dev CTP with freshpromoter.argoproj.io/refresh-ts. - Finish with authenticated smoke tests against the public URLs. For schema-affecting workflow-builder changes, verify the
db-migratehook applied the expected migration before trusting the UI. For Prompt Workbench/preset changes, confirmresource_prompt_versionsexists and run an authenticated/api/prompt-presetslist/create/update/archive smoke. On NixOS, if Playwright's bundled browser cannot launch, use system Chrome at/etc/profiles/per-user/vpittamp/bin/google-chrome.
"I edited stacks manifests and need ryzen to pick them up"
- Decide the right path:
- Ryzen-only image-tag bump (typical after
skaffold run): commit/merge tomainviacommit-pin.sh. Ryzen's LOCAL ArgoCD (root-ryzen@main) reconciles it directly — noinner-loop, noenv/spokes-ryzen, no Promoter. - Hub-affecting change (cluster Secret, ApplicationSet, Tailscale Service annotation): commit to
main, then merge the env/hub-next → env/hub Promoter PR (gh pr list -R PittampalliOrg/stacks --state open --search 'Promote'). - Spoke-workloads change: commit to
main. Ryzen reconcilesoverlays/ryzen@maindirectly; dev/staging pick it up via source-hydrator + their Promoter step.
- Ryzen-only image-tag bump (typical after
- Trigger immediate refresh on ryzen instead of waiting for the poll interval:
kubectl --context admin@ryzen -n argocd annotate application ryzen-<svc> argocd.argoproj.io/refresh=hard --overwrite # or: deployment/scripts/ryzen-sync.sh (hard-refreshes root-ryzen) - If a sync is stuck "another operation is already in progress",
argocd app terminate-op <app>then retry with--replace. - For child
Applicationspec changes that don't propagate, the env/hub Promoter ladder might be stuck: seerunbooks/manage-gitops-promoter.md. The fast path isgh pr create --base env/hub --head env/hub-next+ merge if Promoter hasn't auto-created.
"I updated SWE-bench evaluator/coordinator and need to deploy/test"
- Confirm the workflow-builder commit includes the intended
services/swebench-evaluator/Dockerfilepin or coordinator changes, and is pushed toorigin/mainso the GHCR outer-loop can build it. For ryzen, update the stacks workloads pin onmainand let ryzen's local ArgoCD pick it up. - Watch hub Tekton build
swebench-evaluator:git-<workflow-builder-sha>and capture the GHCR digest fromrelease-pins/workflow-builder-images.yaml/ generated workflow-builder-system overlay. - Track
dev-swebench-coordinatortoSynced/Healthy, then verify the live Deployment hasSWEBENCH_EVALUATOR_IMAGE=ghcr.io/pittampalliorg/swebench-evaluator:git-<sha>with the expected digest-backed release. - Run a focused SWE-bench canary: one known gold patch that resolves, one empty-patch case that returns
empty_patch, and, when available, an environment/build validation case. Evaluator Jobs should use the expected resource class andttlSecondsAfterFinished=3600. - For UI-visible validation, create or trigger a Benchmarks page run (
/workspaces/<slug>/benchmarks) and confirm artifact SHA-256s, provenance, official result, raw harness notes, report path, and job name appear in the run API/UI. The evaluations skill has the DB/coordinator smoke path for deterministic runs.
For DeepSeek SWE-bench validation, use the direct DeepSeek model specs (deepseek/deepseek-v4-pro, deepseek/deepseek-v4-flash) and confirm the selected agent runtime reports provider deepseek with llm-deepseek-v4-* components. Effective concurrency is the minimum of UI/requested concurrency, runtime slots, per-sidecar Dapr workflow capacity, global benchmark caps, sandbox headroom, and model caps; when in doubt read the evaluations skill references/swebench-concurrency.md before changing stacks values.
"A workflow-builder agent is silent after adding an MCP/OAuth connection"
Use runbooks/debug-workflow-builder-mcp-auth.md. The short path:
- Confirm
workflow-builder,activepieces-mcps, andknative-servingareSynced/Healthyon the target cluster. - Confirm the piece appears in
activepieces-mcp-catalogand itsap-<piece>-serviceKService is Ready. The URL should behttp://ap-<piece>-service.workflow-builder.svc.cluster.local/mcpwith no explicit:3100. - Confirm the
mcp_connection.connection_external_idpoints at an activeapp_connection.external_id; MCP credentials flow throughX-Connection-External-Id, not inline secrets in manifests. - Fix the agent's
mcp_connection/agentConfig.mcpServersrows, then launch a fresh session and check the per-session runtime pod env forDAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON(resolved per-session by the BFF — there is no generatedagent-runtime-<slug>Deployment anymore). - Wake/test the agent and read
dapr-agent-pylogs for[mcp-bootstrap] connected ...and tool registration. A first health probe may time out during Knative cold start; retry before declaring the KService broken.
"A workflow-builder runtime pod is 1/2 or daprd readiness is false"
Use runbooks/debug-dapr-sidecar-stale-readiness.md. First identify whether the app container or daprd is not ready. If the app container is ready but daprd reports ERR_HEALTH_NOT_READY for grpc-api-server / grpc-internal-server, check recent dapr-system placement/scheduler churn, verify the control plane is healthy now, then recycle only the affected workflow-builder Deployment. Do not clear state stores or restart Dapr control-plane components unless they are still unhealthy.
"An ArgoCD app is OutOfSync / stuck"
- Query hub ArgoCD even for dev/staging:
kubectl --kubeconfig ~/.kube/hub-config get applications.argoproj.io -n argocd. - Check
kubectl get app <name> -n argocd -o jsonpath='{.status.operationState.phase}'. - Phase=Running for hours? Check
.status.operationState.message— usuallywaiting for completion of hook batch/Job/db-migrate. Drill:kubectl get jobs -n workflow-builderon the spoke; ifdb-migrateis stuck Terminating, seerunbooks/recover-stuck-job-finalizer.md. - Controller log shows "Skipping auto-sync: failed previous sync attempt"? ArgoCD won't retry the same revision — see
runbooks/recover-stuck-promotion.md(terminate-op + force sync via argocd CLI on Tailscale). - Job Pod is
Init:ImagePullBackOffwith "not found"? The image isn't on ghcr.io yet — the outer-loop build didn't produce that tag; rebuild from the source commit (seerunbooks/debug-funnel-orphan-tag.md). - Need a fleet review or decide whether legacy apps should be removed? Use
runbooks/review-argocd-app-health.mdbefore applying fixes. It covers keep/remove decisions, stale status cache, ExternalSecret/Tekton default drift, Tailscale egress mutation, and hub promotion.
"Review all degraded/out-of-sync apps and remove legacy resources"
Use runbooks/review-argocd-app-health.md. The short rule: identify whether each resource is still part of the current system before fixing drift. Known outcomes:
- Keep
agent-sandbox-crds/<spoke>-agent-sandbox-crds; OpenShell and agent-runtime controllers require those CRDs. - Remove AutoKube references; AutoKube is legacy in this repo.
- The old hcloud-spoke Crossplane
AzureWorkloadIdentityclaim/provider path is legacy; hcloud spoke lifecycle now uses hcloud/talos/kubernetes/terraform providers and existing Azure Workload Identity configuration. - For needed apps, prefer making desired manifests match API-controller defaults over broad
ignoreDifferences; use ignores only for intentional operator mutation like Tailscale egress Services.
"GitHub webhook didn't fire / image build doesn't reach ghcr.io"
Triage by gh api .../hooks/<id>/deliveries status_code first — there are TWO common failure modes on the same path:
status_code: 0+dig @1.1.1.1 tekton-hub.tail286401.ts.netNXDOMAIN → Tailscale Funnel orphan-tag onts-tekton-github-triggersproxy. Thepolicy.hujsonlost a tag the device still uses; control plane drops the funnel cap. Seerunbooks/debug-funnel-orphan-tag.md(Funnel orphan tag section).status_code: 202(accepted) but no PipelineRun on hub → EL processing failure.el-github-outer-looplogs showPost "": unsupported protocol scheme ""atsink/sink.go:413for the matching/triggers-eventid. Same runbook, "EL processing failure" section. Workaround: skopeo-mirror to ghcr.io + bump release-pins manually until the EL is fixed.
"I edited a workflow JSON spec — when does it deploy?"
Workflow JSONs at services/<agent>/<name>.workflow.json in the workflow-builder repo are not baked into the workflow-builder image (the production Dockerfile copies src/ and drizzle/ only — services/ is excluded). Spec changes (new prompt, agentKwargs, maxTurns, etc.) require a manual DB upsert against the spoke's postgres. Either run node scripts/<workflow>.mjs --user-email ... from a pod with DATABASE_URL set, or directly UPDATE workflows SET spec = $jsonFromFile.spec, nodes = ..., edges = ... WHERE id = '<workflow-id>'. Image rebuilds alone won't roll the change. See runbooks/upsert-workflow-json.md.
"I shipped a migration but the new columns aren't on dev/staging"
Almost always: the SQL file in drizzle/ is missing from drizzle/meta/_journal.json. npx drizzle-kit migrate (the db-migrate Sync hook) silently skips files without journal entries — Job exits 0 but nothing gets applied. See runbooks/fix-drizzle-migration.md. (BFF will then 500 on every query that includes the new column.)
"I want to track a promotion in flight"
Start with workflow-builder's admin deployment inventory when available; it shows desired image, live images, drift, build, and promotion metadata in one place. The hub ArgoCD UI now has a GitOps Promoter extension for visualizing Promoter CRDs, and PromotionStrategy + ChangeTransferPolicy + spoke ArgoCD apps remain the authoritative lower layers. See runbooks/track-promotion-state.md for both views and a CLI cheat-sheet. Most "stuck" reports are actually normal ~3 min source-hydrator poll cycles.
"workflow-builder works on ryzen but is broken on dev/staging"
Treat this as environment drift, not a live-patch task. Check the promoted spoke runtime env, tailnet exposures (the workflow-builder L4 LoadBalancer Service + tls-terminator sidecar, and any device-backed Ingresses like phoenix-*), ACL policy, spoke API VIP grants, and stale hub hydration. Typical declarative fixes are in spoke-workloads-appset.yaml, packages/base/manifests/tailscale-ingresses/, and policy.hujson. See runbooks/reconcile-workflow-builder-spoke-environment.md.
"I need to upgrade GitOps Promoter or fix the Promoter UI"
Use runbooks/manage-gitops-promoter.md. The current deployment pattern is: keep the latest published Helm chart unless a newer chart exists, override manager.image.tag when the app release is newer than the chart appVersion, and manage the ArgoCD UI extension through argocd-gitops-promoter-ui plus bootstrap deployment/config/argocd-values.yaml. Do not hand-patch long-term state without committing it to stacks.
"Which image / commit is live on dev or staging?"
Use the workflow-builder admin Deployments view or the hub inventory endpoint first. It is backed by gitops-deployment-inventory on the hub and is the fastest way to compare release-pins, Argo live images, promotion SHAs, and outer-loop build status. See runbooks/track-promotion-state.md.
"workflow-builder Deployments shows fetch failed"
First distinguish UI auth from inventory transport. From inside the workflow-builder pod, WORKFLOW_BUILDER_GITOPS_INVENTORY_URL should be http://gitops-inventory-hub-egress.tailscale.svc.cluster.local:8080/inventory.json. The egress Service in tailscale should target tailscale.com/tailnet-fqdn: gitops-inventory-hub-node.tail286401.ts.net, port 8080. Do not target gitops-inventory-hub.tail286401.ts.net from the egress Service; that is a Tailscale service-host VIP, not a tailnet node. See runbooks/track-promotion-state.md.
"A promoted-spoke Tailscale Ingress has no address, a -1 suffix, or stale DNS"
First check whether the Ingress has tailscale.com/proxy-group. Promoted-spoke app URLs such as phoenix-* are normally device-backed Tailscale Ingresses, not svc:* service-hosts. Debug stale tailnet devices, stale Tailscale Services, and operator-managed Secret metadata with runbooks/debug-device-backed-tailscale-ingress.md. (workflow-builder-* is no longer an Ingress — it is an L4 LoadBalancer Service + tls-terminator sidecar since PR #2319; mcp-gateway is in-cluster only. See reference/access-paths.md.)
"A ProxyGroup service-host VIP has no address or TLS/cert is broken"
If the resource is a ProxyGroup-hosted service such as argocd-hub, nocodb-hub, or gitops-inventory-hub, debug service-host tags, not Funnel. Check the Ingress tailscale.com/tags, policy.hujson autoApprovers.services, the Tailscale Service tags, and the proxy pod Self.Tags / CapMap["service-host"]. See runbooks/debug-proxygroup-service-host.md.
"Hub ArgoCD can't reach a spoke / spoke-<name> shows ComparisonError / a new spoke won't register"
- Sync transport (all spokes): under argocd-agent each spoke reconciles its own apps locally; the agent dials the hub principal OUTBOUND (8443) over tailnet mTLS. The hub→spoke kube-API reach (apiserver-proxy SNI, ryzen host-passthrough) is RETIRED for sync. A
spoke-<name>ComparisonError on the hub principal is usually a stale/misconfigured AGENT MAPPING (cluster-<spoke>Secret:server=...?agentName=<spoke>+ embedded mTLS, NO bearerToken) or an agent that isn't dialing in — check the agent pod on the spoke and the principal logs, not a kube-API SNI path. - Registration (enroll, not register-spoke):
register-spoke-with-hub.shis RETIRED. dev enrolls viadeployment/scripts/argocd-agent/enroll-dev-agent.sh(MANAGED agent); ryzen enrolls viadeployment/scripts/argocd-agent/enroll-ryzen-agent.sh(AUTONOMOUS agent — mints the agent mTLS cert, applies thepackages/components/hub-management/manifests/ryzen-agent-bootstrapkustomize component including theroot-ryzenapp-of-apps @main, runsargocd-agentctl agent create ryzento write thecluster-ryzenAGENT MAPPING, stages the Headlamp Secret, hard-refreshesroot-ryzen). For the MANAGED spokes (dev/staging) thespoke-clusters-appset/spoke-workloads-appsetstill fan cluster Secrets into Applications the principal pushes down; edit thepackages/components/hub-spoke-appsets/copy, NOT the unusedhub-basecopy. Ryzen is NOT driven by these appsets — it reconciles its own apps viaroot-ryzen.
See reference/architecture.md ("Hub → spoke ArgoCD connectivity" and "Spoke registration").
"I need kubectl on a spoke (dev / staging) and Tailscale isn't working"
See reference/access-paths.md for normal paths and runbooks/access-spoke-cluster-fallback.md for the Crossplane-kubeconfig-secret fallback.
"OAuth / social login broken — client_id and/or client_secret passed are incorrect"
Almost always: KeyVault *-CLIENT-ID-* and *-CLIENT-SECRET-* were rotated at different times (compare attributes.updated). See runbooks/rotate-oauth-secret.md. Watch for the ESO refresh ↔ pod restart race — reference/secret-flow.md. If login works but the GitHub repo picker is missing org repos, it's NOT a secret problem — see the per-cluster OAuth-app org-grant gotcha below.
Critical gotchas (memorize these)
Ryzen is an AUTONOMOUS argocd-agent spoke with its OWN local ArgoCD; no local Gitea. Ryzen reconciles
packages/overlays/ryzen@mainDIRECTLY viaroot-ryzen— GitHubmainis the source for EVERYTHING ryzen (including image-tag bumps via Skaffold outer-loop +commit-pin.sh). There is noinner-loopbranch (retired), noenv/spokes-ryzen, no source-hydrator and no Promoter on the ryzen lane. (Hub itself still uses source-hydrator + Promoter:packages/overlays/hub→env/hub-next → env/hub, manual-merge PR.)The idpbuilder + local-Gitea path is retired. See
ryzen-spoke-bootstrapskill for the new bootstrap (talosctl + helm + kubectl). Old runbooks referencingidpbuilder stacks syncare obsolete.Recreate automation is script-driven (named entry points). Use these instead of ad-hoc rebuilds:
- dev:
deployment/scripts/talos-hetzner/recreate-dev.shis the ORCHESTRATOR — it wraps data backup/restore (environment_image_builds/agents/workflows) +provision-spoke.sh+bootstrap-spoke-deps.sh+argocd-agent/enroll-dev-agent.sh+ the verify gate. Dev rebuild entry point. - ryzen:
deployment/scripts/bootstrap-spoke-cluster.sh --recreatenow ENROLLS the autonomous agent viadeployment/scripts/argocd-agent/enroll-ryzen-agent.sh+ thepackages/components/hub-management/manifests/ryzen-agent-bootstrapkustomize component (agent-autonomous bundle + paramsmode=autonomous+cluster-ryzen-localalias +stacks-repo-read+ cert ExternalSecrets +root-ryzenapp-of-apps @main;argocd-agentctl agent create ryzenwrites the agent mapping; hard-refreshesroot-ryzen).register-spoke-with-hub.shis RETIRED and NO LONGER CALLED. The--ts-acl-mode/--ts-host-passthroughflags are VESTIGIAL (parsed for compat, ignored). ryzen reconcilesoverlays/ryzen@maindirectly (noinner-loop, no source-hydrator). - hub:
deployment/scripts/recreate-hub.shmodes--verify-only/--seed-secret/--fixups/--dry-run-clone/--in-place --confirm-wipe hub-cluster. It NEVER hcloud-deletes the 5 ash servers;--in-placedoes a rollingtalosctl resetreusingtalos-cluster/main/secrets/hub-secrets.yaml(preserves etcd identity), re-apply, re-bootstrap ONE CP, and bootstrapsonepassword-sa-tokenviaop read(NOT JWKS);--dry-run-clonerehearses on a throwaway cluster viaprovision-spoke.sh. PLUSdeployment/scripts/hub-verify-gate.sh(a 9-check read-only convergence gate) and a self-healingkube-system-fixupsCronJob (packages/components/hub-management/manifests/kube-system-fixups/) that re-applies the Flannel--iface+ CoreDNS anti-affinity patches Talos does not persist. - Recreate-hardening hands-off fixes (PR #2395; full detail in
cluster-desired-state/runbooks/recovery-and-gotchas.md):- root-ryzen repo-server cold-start race — on a fresh ryzen recreate the local controller's FIRST
root-ryzencomparison can race the not-yet-Availableargocd-repo-server(dial :8081 refused) →ComparisonError/sync=Unknown, no child apps, no re-queue for ~5 min.enroll-ryzen-agent.shstep 6b now waitsrollout status deploy/argocd-repo-serverthenannotate application root-ryzen argocd.argoproj.io/refresh=hard;bootstrap-spoke-cluster.shstep 10 hard-refreshes again (re-compare vs the latestmainHEAD). Both non-fatal. - Headlamp kubeconfig stale after EVERY spoke recreate (dev + ryzen) — hub Headlamp builds its kubeconfig only in its
generate-kubeconfiginit-container, so a pod predating the recreate keeps serving the OLD spoke endpoint/CA/token even afterenroll-{dev,ryzen}-agent.shstep 5b re-stages theheadlamp-cluster-<spoke>Secret. Both enroll scripts nowkubectl -n headlamp rollout restart deploy/hub-headlamp deploy/hub-headlamp-embedded(guarded on deploy existence, non-fatal, off the critical path). provision-spoke.sh --destroydeletes Hetzner servers in parallel — the per-serverhc server deletecalls are now concurrent (was sequential ~18s each); ~156s → ~20s for a 9-node dev spoke.
- root-ryzen repo-server cold-start race — on a fresh ryzen recreate the local controller's FIRST
- Validated hands-off: ryzen
bootstrap-spoke-cluster.sh --recreate= 13m9s (64/65 Synced/Healthy, zero manual intervention); devrecreate-dev.sh= 20m32s.
- dev:
Hub secret root is 1Password, NOT Azure Workload Identity (migrated 2026-06). The hub's 21 ExternalSecrets resolve from the
onepassword-storeClusterSecretStore (ESOonepasswordSDKprovider → the dedicatedhub-eso1Password vault). Bootstrap root-of-trust = ONE scoped read-only 1Password Service-Account token (hub-eso-reader) in Secretonepassword-sa-token(nsexternal-secrets), persisted atop://CLI/<id>/credentialand read at recreate via the operator's developer SA token (op read). The oldazure-keyvault-storeCSS + Azure KV (keyvault-thcmfmoo5oeow) + the AD App + the Azure OIDC/JWKS federation are DORMANT (not deleted);sync-jwks-to-azure.shis NO LONGER in the hub recreate path (it is a SPOKE-only tool now). Spokes are UNAFFECTED — they still read hub-mirrored secrets via the ESO kubernetes-providerhub-secrets-storeClusterSecretStore over Tailscale, regardless of how the hub populates its k8s Secrets. The hub spoke-secrets ExternalSecrets that mirror into<cluster>-shared-secrets+tailnet-canow read fromonepassword-store. Seereference/secret-flow.md.Workflow-builder image pin has two visible truths (and ryzen's is now a rendered Component, NOT the bare
images:block — C1, 2026-06-04). For workflow-builder + workflow-mcp-server the bareimages:block was DELETED frompackages/components/workloads/workflow-builder/manifests/kustomization.yaml; that kustomization nowcomponents:-includes the render-generatedpackages/components/workloads/workflow-builder-ryzen-image/kustomization.yaml, which carries the workflow-builder + workflow-mcp-server pin (newName/newTag) and IS ryzen's effective image.commit-pin.shfor these two services UPSERTS the flat pins filepackages/components/hub-spoke-appsets/release-pins/workflow-builder-images-ryzen.yaml(images/imageRefs/digests/sourceShas) AND renders + commits theworkflow-builder-ryzen-imageComponent LOCALLY in the same push (it runsWFB_RENDER_ENVS=ryzen scripts/gitops/render-workflow-builder-release-overlays.shinside its fresh hard-reset~/.cache/skaffold/stacks-ryzenclone — the render is deterministic, byte-identical to CI — wfb PR #37). It does NOT edit the manifestsnewTag. After pushing itrefresh=hardes the ryzen SPOKE-LOCAL app, so ryzen reconciles in SECONDS (no CI wait, no 30s poll). The stacks CI action.github/workflows/render-ryzen-image.ymlis UNCHANGED but is now just a DRIFT-CORRECTION SAFETY NET — it re-renders on push and commits only on a diff, so it NO-OPS when commit-pin's local render already matches. (This is workflow-builder/workflow-mcp-server ONLY; every other Skaffold-owned service — workflow-orchestrator, function-router, mcp-gateway — STILL pins via the barepackages/components/workloads/<svc>/manifests/kustomization.yamlnewTag, edited directly by commit-pin.) Thedev-workflow-builderApplication may still show release-pinspec.source.kustomize.imagesoverrides to the same tag. If ryzen reverts while dev stays correct, check the flat ryzen pins file AND the renderedworkflow-builder-ryzen-imageComponent (commit-pin should have rendered it; the drift-net CI is a backstop), commit any fix toorigin/main(ryzen reconcilesmaindirectly — noinner-looppush), and verify both ryzen's live Deployment image (via--context admin@ryzen) and hub'sdev-workflow-builder.status.summary.images. RESOLVED 2026-06-05 — ryzen is now single-pin (Component only): thepackages/overlays/ryzenapp-of-apps had ALSO been patching theryzen-workflow-builderApplication's OWNspec.source.kustomize.imagesto a hardcoded sha, which ArgoCD applies ON TOP of the rendered kustomization and which therefore silently WON over the Component (commit-pin updated the Component, the app showedSynced, but the Deployment stayed frozen on the stale override sha —argocd-repo-serverrestart did NOT help; it is override-precedence, not a cache). That override was REMOVED (the app keeps only its non-imagepatches:), so the render-generatedworkflow-builder-ryzen-imageComponent is the SOLE ryzen authority anddeploy:skaffold/commit-pin rolls ryzen with NO overlay edit. Telltale of the (now-guarded) trap: the app'sspec.source.kustomize.imagesshows an OLD sha whilekubectl kustomize .../workflow-builder/manifestsrenders the NEW one. A CI guard.github/workflows/validate-ryzen-no-app-image-overrides.yml(+scripts/gitops/validate-ryzen-no-app-image-overrides.sh) now fails the build if any ryzen-overlay Application reintroduces it. (dev legitimately still usesspec.source.kustomize.images— that lane has a SINGLE writer, the release-pins renderer; the anti-pattern is a SECOND authority on the SAME app, which is what ryzen had.)The ryzen Component consumes ONLY the
workflow-builder+workflow-mcp-serverrows of the flat ryzen pins file — and outer-loop merges never write that file.WFB_RENDER_ENVS=ryzen scripts/gitops/render-workflow-builder-release-overlays.shderivesworkflow-builder-ryzen-image/kustomization.yamlfrom just those two services' rows inrelease-pins/workflow-builder-images-ryzen.yaml; any other service's row there is INERT (their ryzen pins live in the bareworkloads/<svc>/manifests/kustomization.yaml). The GitHub outer-loop (merge to wfbmain) builds images and writes the DEV pins file (workflow-builder-images.yaml) + dev overlay only — it does NOT touch ryzen pins. To deliver an outer-loop-built commit to ryzen, NEVER rebuild locally (samegit-<sha>tag, different digest — see the skaffold-dev-loop skill): either commit-pin the existing GHCR tag (SKAFFOLD_IMAGE=ghcr.io/pittampalliorg/<svc>:git-<sha> bash skaffold/hooks/commit-pin.sh <svc>from the wfb repo) or do it manually in stacks — edit the flat ryzen pins file, runWFB_RENDER_ENVS=ryzen scripts/gitops/render-workflow-builder-release-overlays.sh, commit/push (on the NixOS HTTPS-403, push with the gh-token URL:git push "https://x-access-token:$(gh auth token)@github.com/PittampalliOrg/stacks.git" HEAD:main), thenrefresh=hardtheryzen-<svc>app. Noteworkflow-mcp-serveris an actually-deployed workload since 2026-06 (was manifest-only): Deployment +Service-workflow-mcp-server.yaml(port 3200) added to the workflow-builder kustomization,DATABASE_URL/INTERNAL_API_TOKENviaenvFrom workflow-builder-secrets; it hosts the goal MCP tools + workflow tools, so its pin matters now.GitOps delivery webhook/relay topology (why ryzen needs a spoke-local refresh, verified v0.8.1, 2026-06-04). The 3 GitHub webhooks are all HUB-FACING (Tailscale Funnel):
tekton-hub(build EL),argocd-webhook-hub(/api/webhook),gitops-promoter-webhook-hub— NONE reaches a spoke directly. The argocd-agent principal relays a hub-sideargocd.argoproj.io/refreshannotation to MANAGED agents (dev reconciled in ~3s in a live test) but does NOT relay it to AUTONOMOUS agents: ryzen has no argocd-server (no inbound/api/webhook) and a hub-side refresh on the ryzen mirror NEVER reached the spoke live (an argocd-agent code-read suggests #447/v0.2.0 should enable autonomous refresh, but it did not reproduce — trust the live behavior). So ryzen's ONLY fast path is the SPOKE-LOCALrefresh=hardthat commit-pin/ryzen-sync.shissues; otherwise it's the 30s poll. Dev hydration is webhook-accelerated: stacks PR #2449 addedargocd.argoproj.io/manifest-generate-paths(pointing at each spoke'sworkflow-builder-system-overlays/<spoke>dry-source path) to thespoke-workloadshydrator appset, so the hubargocd-webhook-hubfires hydration on a release-pin render into that overlay (without it, the hydrator waited ~120s for its poll); that hydrator app (spoke-dev-workflow-builder) is HUB-reconciled, so the webhook drives it directly — no agent relay on that hop. Ryzen is unaffected (sourced byroot-ryzen, not this hub appset).Outer-loop release handoff can be direct-main or PR-mode. Hub Tekton
update-stacksis the source of truth: inspect its logs and Git state to see whether release metadata was pushed directly toorigin/mainor placed on arelease/workflow-builder-*PR branch. Direct human edits to release pins should be exceptional and must passscripts/gitops/validate-workflow-builder-release-pins.sh.stacks-environmentsPromotionStrategy hasautoMerge: false. Unlikeworkflow-builder-release(env/spokes-dev only — staging dormant, PR #2436) which auto-merges afterargocd-health, theenv/hubPR (gitops-promoter-*[bot]: Promote <sha> to env/hub) requires manual merge. Every change underpackages/overlays/hub(which includesspoke-workloads-appset.yaml, AppSet templates, etc.) opens such a PR and the dev/staging cascade is blocked until it's merged. Easy to miss becauseworkflow-builder-releaseIS auto.Hub promoter status can lag branch tips. If
env/hub-nexthas a newer hydrated SHA butstacks-environments-env-hub-*still proposes the prior dry SHA/PR, annotate bothPromotionStrategy/stacks-environmentsand theChangeTransferPolicywith a freshpromoter.argoproj.io/refresh-ts.workflow-builder-releasecan lag source-hydrator too. Ifenv/spokes-dev-nexthas advanced butworkflow-builder-release-env-spokes-dev-*still proposes the prior dry SHA after one poll interval, refreshPromotionStrategy/workflow-builder-releaseplus the dev CTP withpromoter.argoproj.io/refresh-ts. Do not use hard-sync as a substitute for Promoter catching up.Concurrent outer-loop commits can leave Promoter one dry SHA behind. A single app push can trigger multiple outer-loop updates, such as workflow-builder and workflow-orchestrator release metadata. If source-hydrator's current dry SHA is newer but the workflow-builder dev CTP keeps proposing the previous dry SHA after a poll, refresh
PromotionStrategy/workflow-builder-release, the dev CTP, and hard-refreshspoke-dev-workflow-builderbefore declaring the rollout stuck.Release-pin validation needs each GHCR package linked to
stacksvia Manage Actions access. Thevalidate-workflow-builder-release-pinsGitHub Action runsskopeo inspectagainst every image with${{ github.token }}(which only haspackages: read). PittampalliOrg's GHCR container packages are private and built by other repos (workflow-builder, opencode-durable-agent, etc.), so the workflow's token can only read them if each package hasPittampalliOrg/stacksadded under Manage Actions access with Role: Read. Missing link → every image fails identically withreading manifest <tag> in ghcr.io/pittampalliorg/<image>: denied(authz, not "tag missing"). Adding a new image torelease-pins/workflow-builder-images.yamlrequires linking its package before merging. Seerunbooks/grant-stacks-ghcr-package-access.md.Release-pin hydration must be dry-source deterministic.
spoke-workloads-appset.yamlshould select the spoke cluster and point source-hydrator atpackages/components/workloads/workflow-builder-system-overlays/<spoke>; it should not templateimageRefs,sourceShas, sandbox-image env, or other release-pin values inline. Argo CD source-hydrator is dry-SHA oriented, so if rendered output depends on values outside the dry source a race can produceenv/spokes-<spoke>-nextwith stale child Application images even while.status.sourceHydrator.currentOperation.drySHAis current. Fix stale release-pin renders by regenerating the overlays and committing a real dry-source change, not by relying on empty commits as the normal path.Do not delete agent sandbox CRDs as "duplicates."
agent-sandbox-crdsis the CRD owner for the upstream kubernetes-sigs/agent-sandbox resources (Sandbox,SandboxClaim,SandboxTemplate,SandboxWarmPool). It is separate from controllers and workload apps by design so CRDs sync early. (The customAgentRuntimeCRD it used to also own is retired.)AutoKube is legacy. If AutoKube Applications, Ingresses, ACL service approvals, or manifests appear, remove them declaratively and let Argo prune them instead of repairing them.
Argo drift review is keep/remove first, fix second. For needed resources, run
argocd app diffand prefer declaring API defaults (ExternalSecret defaults, Tekton EventListener defaults, CRD defaults) over hiding real drift. Emptyargocd app diffwith OutOfSync status can mean stale Argo status (hard-refresh, then restart the application controller if it persists) — BUT for ExternalSecrets it usually does NOT: it's the ESO server-defaulted fields that ArgoCD's client-side diff flags but the CLI/SSA normalize away, so check the UI Diff tab, not the CLI. That class is now muted fleet-wide by a globalargocd-cmresource.customizations.ignoreDifferences.external-secrets.io_ExternalSecret(2026-05-30); ESO isv2.4.1on theexternal-secrets.io/v1API. Don't burn time on controller restarts for ESO empty-diff OutOfSync — seecluster-desired-staterunbook §L andrunbooks/review-argocd-app-health.md.Runtime images flow through BFF env vars, not release-pins. The per-session agent-sandbox Sandbox pod image is selected at launch time by the BFF reading
env.AGENT_RUNTIME_BROWSER_USE_DEFAULT_IMAGE(browser-use-agent),env.AGENT_RUNTIME_DEFAULT_IMAGE(dapr-agent-py default), orenv.AGENT_RUNTIME_CLAUDE_DEFAULT_IMAGE(claude-agent-py) — keyed off the runtime registry'simageEnvKey. These env vars are static literals onDeployment-workflow-builder.yaml. Bumping release-pins does not update them; edit the Deployment YAML and verify the live BFF pod sees the new env — the NEXT session then pulls the new image (no per-agentAgentRuntimeCR to patch; the CRD + controller are retired). The staticdapr-agent-pypool rolls via its own Deployment image. Seerunbooks/bump-image-pin-not-in-release-pins.md.Per-session sandbox image on DEV flows pins → render → the dev overlay's
SANDBOX_EXECUTION_CLASSES_JSONpatch — the base manifest tag is INERT for dev. The agent-host image for per-session Sandbox pods isagentHostImageinside sandbox-execution-api'sSANDBOX_EXECUTION_CLASSES_JSON(a JSON-string env var —kustomize.imagescannot rewrite inside it). On dev, bump the release pins and runscripts/gitops/render-workflow-builder-release-overlays.sh: it regeneratesworkflow-builder-system-overlays/dev/kustomization.yaml, which patches that env var (every execution class'sagentHostImage) onto the sandbox-execution-api Deployment. The hardcoded tag in the baseworkloads/sandbox-execution-api/manifests/Deployment-sandbox-execution-api.yamlis stale-by-design and inert wherever the overlay applies — don't "fix" it expecting a dev rollout. Rollout nuance (verified live): sessions spawned during the rollout window keep their old pod, and a mid-session pod reschedule preserves conversation history (durable state) — no need to kill in-flight sessions to land the image.SWEBENCH_EVALUATOR_IMAGEis an env var, not a container field. Theswebench-coordinatorbase has a kustomize replacement that copies the rewritten image from local-configPod/swebench-evaluator-imageinto the Deployment env var. If a coordinator launches stale evaluator Jobs, verify both the generated overlay and live Deployment env. On ryzen, an activeskaffold devsession (with ArgoCD skip-reconcile) can mask the declarative env the same way it masks workflow-builder BFF env vars; dev is the cleaner rollout target for promoted SWE-bench evaluator validation.Sandbox templates (
workspace_profile.with.sandboxTemplate) resolve viaSANDBOX_TEMPLATE_IMAGES_JSONon the workflow-builder Deployment (NOTkustomize.images). For dev/staging, the release-pins renderer stamps this env var into the generated dry-source overlays. The env var is a JSON object mapping template names to image refs. Adding a new template name = (1) add aDockerfile.<name>underservices/openshell-sandbox/environments/in the workflow-builder repo, (2) commit with subjectenvironment(<name>):so the env-image-build pipeline fires, (3) add the image pin/rendering path in stacks.Legacy Gitea dev-image commits are not the ryzen hot path. If an old artifact mentions
chore(dev-images): deploy ... to ryzen, treat it as historical build-lane evidence. Update the stacks pin and use the GitHub branch flow (commit-pin or Promoter PR).Orchestrator
wfstate_stateorphan reminders can block new StartInstance.workflowstatestoreisstate.postgresql v2withtablePrefix=wfstate_. When a workflow row is purged but its actor reminder is still in dapr-scheduler-server's ETCD, daprd retries the reminder ~every 10s and logsUnable to get data on the instance: <id>, no such instance exists. The retry loop can serialize behind the workflow runtime worker queue and make newctx.call_child_workflow/StartInstancecalls hitDEADLINE_EXCEEDEDafter 60s. Confirm via daprd logs first. For terminal cleanup, prefer the vetted Lifecycle Controller (BFFsrc/lib/server/lifecycle/,stopDurableRunwithmode:"purge"/"reset") or the BFF benchmark cleanup endpoint — both do scoped terminate/poll/purge per per-session app-id, and the orchestrator'spurge_workflownow forwardsforce(purge-force, Dapr 1.17.9, which cleans the associated reminders). The lifecycle-terminal-reaper CronJob (workflow-builder ns) reconciles stuck rows on a timer (skipped while a benchmark run/lease is active). Manualwfstate_statetruncation is incident recovery only, after active runs and leases are zero.environment(<slug>):commit subject is the only trigger for the env-image-build pipeline. The hub-tekton EventListener (build-environment-imagetrigger inEventListener-workflow-builder-fn-builds.yaml) filters onbody.commits[*].message ~= '^environment\\(.+?\\):'AND a modifiedservices/openshell-sandbox/environments/Dockerfile.<slug>path. Both conditions must hold per push. Slug is extracted viac.message.split('(')[1].split(')')[0]. Commit message typos likeenv(code-eval):will silently skip the build with no visible error.ActivePieces piece MCP URLs should not include port
:3100when targeting Knative. The container listens on 3100, but callers hit the cluster-local Knative Service URL. StaleagentConfig.mcpServersor workflow configs containinghttp://ap-...svc.cluster.local:3100/mcpbypass Knative routing and can leave agents silent.MCP auth is request-scoped by connection external ID. For piece MCP tools, the runtime sends
X-Connection-External-Id;piece-mcp-servercalls workflow-builder's internal decrypt API. Do not put OAuth tokens, decrypted credentials, or user-specific secrets into KService env, workflow JSON, or GitOps manifests. The reconciler may set a fallbackCONNECTION_EXTERNAL_ID, but per-request headers are the correct multi-user path.ActivePieces piece-runtime services are generated from DB state by an all-catalog reconciler. The
activepieces-mcpsreconciler (every 2 min) provisions every catalog piece'sap-<piece>-servicefrom enabledmcp_connectionrows +PINNED_PIECES, so new pieces are automatic — no manual per-piece add. Eachap-<piece>-serviceis the converged piece-runtime (onepiece-mcp-serverimage,PIECE_NAMEenv) serving/execute(deterministic activities — replacing the deletedfn-activepieces) +/mcp+/options+/health. The set pinned (or workflow-referenced) is held atminScale=1; the rest scale to zero. If a user adds Outlook/Excel/OneDrive and the KService is missing, debugactivepieces-mcp-reconcilerbefore patching workloads by hand.Piece-runtime KServices scale to zero by design.
knative-servingmust allowallow-zero-initial-scale: "true", and scale-to-zero services useinitialScale: "0". Cold starts can make the first/health,/mcp, or/executeprobe exceed a short timeout; retry with a longer timeout before treating it as a hard failure.The converged piece-runtime image needs
NODE_OPTIONS=--max-old-space-size=400+ a512Mimemory limit. Thepiece-mcp-serverimage OOM-kills at MODULE LOAD under a 256Mi/384Mi limit (loading all 42 AP piece packages). Ifap-<piece>-servicepods CrashLoop / OOMKill on startup, verify the generated KService carriesNODE_OPTIONS=--max-old-space-size=400and a512Milimit before suspecting the piece code.Dapr durable protocol compatibility depends on a single actor state store per sidecar. Current workflow-builder expects
workflowstatestoreto be the onlyactorStateStore=trueComponent visible in the namespace.dapr-agent-py-statestoremust stayactorStateStore=false; it stores agent application state, not workflow actor state. If agent sessions hang after a runtime change, verify Component metadata before restarting pods or clearing state.Dapr workflow cleanup is a lifecycle, not an instant delete — and it's now automated + vetted. Termination requests can return before a workflow is terminal. Every user-facing stop routes through ONE Lifecycle Controller (
src/lib/server/lifecycle/{index,cascade,resolvers,reaper,ownership}.ts,stopDurableRun(target, {mode: interrupt|terminate|purge|reset})) which fans out terminate/purge explicitly per per-session app-id (the native Dapr recursive cascade doesn't cross task hubs; the orchestrator's oldterminate_durable_runs_by_parent_executionwas retired). It terminates/polls/purges per-session session+turn workflows first, then the parent, then reaps Sandbox CRs and flips DB rows terminal. Request/confirm (not one-shot fail-closed): a stop that can't confirm in-request returns HTTP 202 "stopping" + persists astop_requested_atintent, and only flips DB / reaps once Dapr is confirmed terminal (via theGET …/stop/statuspoll + the reaper); 409 only on a genuine failure orcoordinator_owned. Adurable/runparent wedged awaiting a cross-app child (a sub-orchestration on a separate per-session task hub Dapr's recursive terminate can't reach) is force-finalized byconfirmDurableStopafter a grace once its child session is DB-terminated — this is exactly why the reaper/confirmDurableStopexist (the cross-app dispatchcall_child_workflowwas KEPT; fire-and-poll dispatch was tried and reverted). Two GitOps safety nets back it: the lifecycle-terminal-reaper CronJob (POST /api/internal/lifecycle/reap-terminal, reconciles stuck DB rows vs terminal/gone Dapr instances — it reconciles even during benchmark activity post-#69; only its aged-stuck pass defers to an execution owned by a still-active coordinator run) and the workflow-builder-sandbox-gc CronJob (age-based GC of orphaned per-session agent-host Sandbox CRs in theworkflow-buildernamespace, excludes SandboxWarmPool-owned). DaprstateRetentionPolicyis unified at 168h across the parent (workflow-orchestrator-no-tracing) and the per-session child Configs (workflow-builder-agent-runtime,openshell-sandbox-dapr) — closing the old 168h-vs-30m split-brain that auto-purged children before the parent finished. A guarded, dry-run-by-defaultrunbooks/phase0-lifecycle-clean-slate.{sh,md}(stacksworkflow-buildercomponent) is the one-time bulk-purge; it is NOT auto-run. If cleanup cannot prove closure, leave leases/sandboxes in place for a retry rather than creating invisible running workflows with missing workspaces.Dapr sidecar liveness can stay green while readiness is permanently false. After placement/scheduler restarts or cert churn, workflow-builder runtime pods can show
1/2because the app container is healthy butdaprdreturnsERR_HEALTH_NOT_READY: [grpc-api-server grpc-internal-server]. Logs often includeActor runtime shutting down,Placement client shutting down, orWorkflow engine stopped. Verifydapr-systemis currently healthy, then recycle the affected Deployment (openshell-agent-runtime,swebench-coordinator, or another Dapr-enabled runtime). Seerunbooks/debug-dapr-sidecar-stale-readiness.md.Workflow JSON specs do not flow through image rebuilds.
services/<agent>/<name>.workflow.jsonis excluded from the production Dockerfile copy list. Editing it in the repo + rebuilding doesn't change runtime behavior; the spoke'sworkflows.specJSONB column is read at execution time. Updating the spec requires a DB UPDATE on each spoke. Seerunbooks/upsert-workflow-json.md.ArgoCD SSA validation blocks parent-syncs-child-Application apply. When the parent app (e.g.,
spoke-dev-workflow-builder) tries to apply a kustomize-patched child Application (e.g.,dev-browserstation), you may seeApplication.argoproj.io "<child>" is invalid: status.sync.comparedTo.source.repoURL: Required value. The parent's SSA payload nullifies a status field the CRD validator requires. Workaround: patch the live child directly withkubectl patch app dev-<name> --type=json -p='[{op:replace,path:/spec/source/kustomize/images/0,value:...}]'. The parent will keep retrying with the failing apply but the child's live spec is correct.ArgoCD 3.4.x
ClientSideApplyMigrationwedges large CRDs (argo-cd#26279). Before doing SSA, ArgoCD 3.4.x runs aClientSideApplyMigrationstep when the live object is not yet argocd-controller-owned. That intermediate client-side apply writes alast-applied-configurationannotation; for very large objects (the ~1.4MBworkloads.kueue.x-k8s.ioCRD) the annotation exceeds the 262144-byte etcd annotation limit and the sync wedges. Triggered on ryzen because the CRD had been hand-applied withkubectlduring recovery, sokubectlco-owns it (managedFieldsowners =kubectl, argocd-controller, kube-apiserver, kueue). Fix is a ryzen-only overlay patch addingClientSideApplyMigration=falseto thekueueApplication'ssyncOptions(packages/overlays/ryzen/kustomization.yaml~line 261) — pure SSA, clean ownership transfer, no Workload CR data loss. Keep it whilekubectlco-owns the CRD (harmless no-op on a clean recreate). hub/dev/staging never hit this (argocd-controller owned the CRD from the start).A kustomize RFC6902
op: add /spec/source/kustomizeREPLACES the whole node (last-writer-wins). When two co-located patch blocks bothop: addto the same path (e.g. bothpackages/components/profiles/local-core-ryzenANDpackages/overlays/ryzenpatch thetailscale-operatorApplication's/spec/source/kustomize), the later block clobbers the earlier one entirely — you do not get a merge. The overlay runs after the component, so the overlay block wins; anything that must survive (e.g. thegitea-tailscale-backendService$patch: deletethat stops ryzen syncs failingnamespaces gitea not found) must be co-located inside the winning overlay block, not split into the component. This clobber rule governs every co-locatedop: addbetweenlocal-core-ryzenandoverlays/ryzen.KubeRay head pod doesn't auto-roll on image change. When a RayCluster spec image is bumped via
kustomize.images, the KubeRay operator gradually rolls workers but the head stays on the old image until explicitly deleted (kubectl delete pod -n ray-system browserstation-head-<id>). Workers wait on head GCS viawait-gcs-readyinit container, so a stuck old head blocks worker rollout too. Verify withkubectl get pod -l ray.io/cluster=browserstation -o jsonpath='{range .items[*]}{.metadata.name} {.spec.containers[?(@.name=="ray-head")].image}{.spec.containers[?(@.name=="ray-worker")].image}{"\n"}{end}'.Buildah short-name resolution is enforced in noninteractive Tekton builds.
FROM rayproject/ray:2.47.1-cpufails withshort-name resolution enforced but cannot prompt without a TTY. Always fully-qualify base images (docker.io/rayproject/...). Fix is in the Dockerfile, not the pipeline.Unpinned
pnpm→ v10 fails the prod build withERR_PNPM_IGNORED_BUILDS(wfb PR #42). A Node service whose Dockerfile doesnpm install -g pnpm(unpinned) gets pnpm v10, which FAILS atRUN pnpm buildwithERR_PNPM_IGNORED_BUILDS— esbuild/protobufjs build scripts are blocked behind an approval gate. FIX: pinpnpm@9(like mcp-gateway); do NOT rely on--ignore-scripts(leaves esbuild's binary missing for the build stage). KEY INSIGHT: such a prod-build break can HIDE for weeks/indefinitely because the per-service trigger only fires on aservices/<svc>/change — the image just stays frozen at the last successful build. (function-router was stuck at a May-21 image for exactly this reason; recover via the build-to-current PipelineRun technique oncepnpm@9is pinned.)Active
skaffold devsessions can mask declarative image rollout. The ryzen Application can be Synced/Healthy and the declarative Deployment image can point at the new tag while a Skaffold-owned dev pod serves live traffic from synced source (ArgoCD paused viaskip-reconcile). Verify the actual serving pod, image, and synced files before declaring ryzen done.Skaffold-owned dev pods cache stale env vars across ArgoCD updates. A subtle variant of the above: when ArgoCD bumps an env var on
Deployment-workflow-builder.yaml(e.g.AGENT_RUNTIME_DEFAULT_IMAGE), the Deployment+ReplicaSet roll, but the long-livedworkflow-builder-dev-*pod was created hours/days earlier and won't restart on its own. The serving pod reads the OLD env value, so the BFF keeps spawning per-session Sandbox pods on the stale runtime image despite the manifest being "synced". Diagnose:kubectl get deploy workflow-builder -o jsonpath='{...env...}'shows the new value butkubectl exec deploy/workflow-builder -- printenv AGENT_RUNTIME_DEFAULT_IMAGEstill shows the old one. Recovery: exitskaffold dev(which removes the dev override) ORkubectl delete pod workflow-builder-dev-*(forces a fresh pod from the current Deployment template). After either, verify the standardworkflow-builder-*pod has the expected env and launch a fresh session to confirm the new runtime image.The runtime image's single source of truth is the BFF env var, read per-session. There is no
AgentRuntimeCR to patch and no revert loop — the CRD + Kopf controller are retired. To roll the runtime forward durably: (1) bump theAGENT_RUNTIME_*_DEFAULT_IMAGEenv var in the stacks Deployment YAML AND (2) verify the BFF pod (not just the manifest) sees the new value (kubectl exec deploy/workflow-builder -- printenv ...), then (3) launch a fresh session — the next agent-sandbox Sandbox pod pulls the new image. (For the staticdapr-agent-pypool, roll its Deployment instead.) Seerunbooks/bump-image-pin-not-in-release-pins.md.rayproject/ray:2.47.1-cpuships Python 3.9. PEP-604 union syntax (def f(x: float | None)) fails at module import withTypeError: unsupported operand type(s) for |: 'type' and 'NoneType'. Addfrom __future__ import annotationsat the top of the file or useOptional[X]. Caused a head-pod CrashLoopBackOff on the Tier 2 browserstation rollout.A release-pinned tag missing on GHCR means the outer-loop build didn't run. The local Gitea image registry on ryzen is retired (May 2026) and all clusters pull from
ghcr.io/pittampalliorg/*, so the old gitea→ghcr mirror runbook was removed (its source registry no longer exists). When agit-<sha>tag is absent from GHCR, rebuild it from the source commit via the GitHub outer-loop (or a manual build-and-push to GHCR using hubghcr-push-credentials); seerunbooks/debug-funnel-orphan-tag.mdfor the webhook/EventListener failure that suppresses the build. Thegithub-outer-loopEL has per-service triggers for ALL Skaffold-owned services (workflow-builder, workflow-orchestrator, function-router, mcp-gateway, swebench-coordinator — verified end-to-end 2026-06-05), each firing on aservices/<svc>/change. But because a trigger ONLY fires on a source change to its own service, a service that has had noservices/<svc>/push stays frozen at its last successful build — use the build-to-current PipelineRun technique (see the Outer-loop lane bullet) to re-pin it from currentmainHEAD without a source change.Tailscale Funnel orphan tags silently break webhooks. If a tag is removed from
policy.hujsonbut a device still uses it, the operator pod claims "Funnel on" locally but the control plane revokes the cap. Public DNS goes NXDOMAIN. Diagnostic:tailscale status --json | jq '.Self.{Tags, CapMap}'from inside the proxy pod.ProxyGroup service-host tags are separate from Funnel tags. For hub browser VIPs, the Ingress
tailscale.com/tags, Tailscale Service tags,policy.hujsonautoApprovers.services, and the authenticated ProxyGroup pod tag must agree. Hubcluster-ingressshould authenticate astag:k8s-services;tag:k8sis legacy compatibility only.Device-backed Tailscale Ingresses are not
svc:*service-hosts. Promoted-spoke app URLs withouttailscale.com/proxy-group(e.g.phoenix-*) register as Tailscale devices, usually taggedtag:k8s. Do not addautoApprovers.services["svc:<hostname>"]for these. A stalesvc:<hostname>record can reserve the canonical DNS name and force the real device to register as<hostname>-1.workflow-builder web exposure = Tailscale L4 LoadBalancer + in-cluster TLS, NOT an Ingress, NO Let's Encrypt (PR #2319).
workflow-builderis reached athttps://workflow-builder-{dev,ryzen,staging}.tail286401.ts.netvia a Tailscale L4 LoadBalancer Service (type: LoadBalancer,loadBalancerClass: tailscale,tailscale.com/hostname, 443→https-tls) whose HTTPS is terminated by a per-pod nginxtls-terminatorsidecar serving a persistent self-signed wildcard*.tail286401.ts.net. The oldingressClassName: tailscale+ LEdevelopment-prod-certexposure (and ryzen's brief plain-HTTP LB, #2314/#2316) is retired — recreate churn exhausted LE's 5-certs/168h limit (429 → unreachable). The wildcard is signed by thetailnet-dev-caClusterIssuer, restored from KV (TAILNET-DEV-CA-CRT/-KEY) via hub nsspoke-secretsSecrettailnet-ca→ spoke base apppackages/components/tailnet-ca. The LB Service registers as atag:k8sdevice (same ACL rule as device-backed Ingresses); it is NOT asvc:*service.mcp-gatewaywas dropped from the tailnet —MCP_GATEWAY_BASE_URL=http://mcp-gateway.workflow-builder.svc.cluster.local:8080;ORIGIN/APP_PUBLIC_URLstayhttps://workflow-builder-<cluster>.... Manifests:Service-workflow-builder-tailnet.yaml(dev/staging) /workflow-builder-tailnet-lb/(ryzen) + the sidecarConfigMap/Certificateunderworkflow-builder/manifests/. Seereference/access-paths.mdandreference/architecture.md.workflow-builder-<cluster>502 for browsers but 302 for curl (PR #2327). Thetls-terminatornginx default 8k proxy header buffer overflows on SvelteKit auth's largeSet-Cookieheaders, so browsers get 502 while barecurl(small headers) returns 302 — masking it. Fix lives in the sidecarConfigMap-workflow-builder-tls-terminator.yaml(proxy_buffer_size 32k; proxy_buffers 8 32k; proxy_busy_buffers_size 64k; large_client_header_buffers 4 32k). Verify HTTPS exposure with a REAL browser (orcurlwith full browser headers), and diagnose via the sidecar nginx error log. Opening the URLs cleanly also needs workstation trust of the "PittampalliOrg Tailnet Dev CA" (nixos-config44ba6324; the Chrome NSS seed is required on NixOS becausesecurity.pkidoesn't cover Chrome).gitea-registry-credsimagePullSecret is RETIRED fleet-wide (PR #2317) — do not re-add it. It was a dead reference (the secret was never produced on any cluster); PR #2317 removed it from 23 manifests + 2 SAs. All images pull fromghcr.io/pittampalliorg/*viaghcr-pull-credentials. (deployment/scripts/trigger-tekton-builds.shkeeps a same-named build-side PUSH credential — different, intentionally kept.)Stale tailnet devices: gated pre-recreate script + the hub sweeper backstop (PRs #2322/#2325). The hard on-recreate guarantee against
<hostname>-Ncollisions is the gateddeployment/scripts/cleanup-tailnet-devices.shrun pre-recreate. As a hygiene backstop, hub CronJobtailnet-device-sweeper(nstailscale, every 15m) deletes OFFLINE stale spoke devices (lastSeen > 30m, best-effort). API gotcha: the devicehostnamefield DROPS the-Nsuffix (a live device and its dead-Ntwin share onehostname) — match on the MagicDNSname;lastSeenIS a reliable liveness signal. An in-Composition pre-onboarding cleanup was deliberately NOT built (a function-pipeline error would halt ALL spoke provisioning).env/hub-nextcan go MISSING after a hub promotion PR merges. GitOps Promoter logsChangeTransferPolicyNotReady"couldn't find remote ref env/hub-next" andPromotionStrategy stacks-environmentsgoes NotReady (flooding warning events). It is NOT GitHub auto-delete (delete_branch_on_merge=false); onlyenv/hub-nextis affected (spoke-nextbranches self-heal via their busy hydrators; the idle hub hydrator doesn't recreate it). Whenactive == proposeddry SHA (no pending hub change), recreate it:git push origin origin/env/hub:refs/heads/env/hub-next; the Promoter reconciles to Ready. Seerunbooks/manage-gitops-promoter.md.Tailscale operator Secret metadata matters after manual recovery. Ingress proxy state Secrets for device-backed Ingresses (e.g.
tailscale/ts-phoenix-tailscale-*-0) must keep labelstailscale.com/managed=true,tailscale.com/parent-resource=<ingress>,tailscale.com/parent-resource-ns=<namespace>, andtailscale.com/parent-resource-type=ingress. If a manual auth/key repair leaves a hugekubectl.kubernetes.io/last-applied-configurationannotation or strips labels, the endpoint may work while ArgoCD staysProgressing; restore labels and remove the stale annotation.Tailscale egress targets nodes, not service-host VIPs.
gitops-inventory-hub.tail286401.ts.netis a service-host VIP backed bycluster-ingress; egressing to it produces "node not found",ECONNREFUSED, or timeouts. Spoke inventory fetches must usegitops-inventory-hub-node.tail286401.ts.net:8080throughgitops-inventory-hub-egress.tailscale.svc.cluster.local:8080.Tailscale operator mutates egress Services. It writes
/spec/externalNameand may add/spec/ports/0/targetPort; Argo Applications that own egress Services should ignore both fields or they will stay OutOfSync despite working traffic.Dev/staging service URLs must be declared, not inferred from ryzen. Phoenix URLs belong in the spoke ApplicationSet template with
{{cluster}}, with matchingphoenix-*Tailscale Ingresses.MCP_GATEWAY_BASE_URLis now the in-clusterhttp://mcp-gateway.workflow-builder.svc.cluster.local:8080(mcp-gateway dropped from the tailnet, PR #2319 — nomcp-gateway-*Ingress). Addpolicy.hujsonsvc:*approvals only when the hostname is actually served by a ProxyGroup/Tailscale Service.Spoke API VIPs need both service approval and Kubernetes grants.
dev-api-v2/staging-api-v2service-hosts needautoApprovers.servicesentries, and the authenticatedtag:spoke-apidevices need a Kubernetes impersonation grant totag:k8swithsystem:masters. Re-authenticate the spoke ProxyGroup after ACL changes if the device still has stale caps.ESO refresh ↔ pod restart race. When rotating a KeyVault secret, ESO may not finish writing the K8s Secret before a Deployment restart kicks off. The new pod reads the stale value. Always verify the K8s Secret head matches the new value before triggering the restart.
GitHub repo picker missing org repos = the per-cluster OAuth app lacks the org grant — dev and ryzen are SEPARATE GitHub OAuth apps.
Workflow Connections (Dev)(Ov23linctlmmlA9F8odt) andWorkflow Connections (Ryzen)(Ov23liqsg0KjlK52R2at) each need PittampalliOrg access granted in the GitHub org's OAuth-app settings; the grant applies LIVE to existing tokens (no re-auth, no secret rotation). The client_id a cluster uses comes from theplatform_oauth_appsDB table (seeded bysync-oauth-apps), not an env var — check there before assuming a config drift. Separately,/api/scm/repospaginates since wfb PR #89 (5 pages / 500 repos) — before that a single-page fetch silently truncated the picker at 100 repos, which mimics a missing-grant symptom.Hub→spoke kube-API reach is RETIRED as the ArgoCD sync path (argocd-agent v0.8.1). Each spoke reconciles its own apps locally; the agent dials the hub principal OUTBOUND (8443). The
cluster-<spoke>Secret is an AGENT MAPPING (server=https://argocd-agent-resource-proxy:9090?agentName=<spoke>+ embedded mTLS, NO bearerToken), not a kube-API endpoint. The legacy hub→ryzen apiserver-proxy SNI path (operator v1.92.4+ strictly validates the wire SNI;cluster-ryzenserver: https://ryzen-operator...+ hub CoreDNS rewrite toryzen-api-egress; the staleryzen-operator-1device +curl --connect-toverify) and the ryzen host raw-TCPtailscale serve --tcp=6443passthrough now exist ONLY for Headlamp (the host-passthrough endpoint lives in the dedicatedheadlamp.dev/cluster=trueSecret, separate from the agent mapping) — NOT for ArgoCD sync. The ryzen→hub ESO secret-fetch transport (a different device + RYZEN-side CoreDNS rewrite) is also unaffected. Seereference/architecture.md. Don't use MagicDNS directly from hub pods — it fails or hangs.Spokes are registered by an Argo CD cluster Secret + two appsets. The cluster Secret (
argocd.argoproj.io/secret-type: cluster+stacks.io/{hub-managed,cluster-role,platform}labels +spoke-cluster/stacks.io/source-branchannotations) in hubargocdns is the contract.spoke-clusters-appset.yaml(clusters generator) templates the rootspoke-<name>Application (pathpackages/overlays/<name>,targetRevisionfrom the source-branch annotation);spoke-workloads-appset.yamladds theworkload.stacks.io/workflow-builder=trueselector and templatesspoke-<name>-workflow-builder. ryzen OMITS the workflow-builder label on purpose (its overlay composes workflow-builder-system directly), so only spoke-clusters generates for ryzen. dev/staging cluster Secrets are minted by Crossplane onboarding;cluster-ryzenis a STATIC GitOps-delivered Secret. GOTCHA: there is a SECOND, unusedspoke-clusters-appset.yamlunderpackages/components/hub-base/apps/that hardcodestargetRevision: main+ has the empty-kustomize: {}hydrator-stall trap — the hub uses only thehub-spoke-appsetscopy. Edit thehub-spoke-appsetscopy. Seereference/architecture.md.Hub build nodes are the default build capacity. The current hub baseline is three
cpx41control/management nodes plus two taintedccx33build workers (stacks.io/build-pool=hub, upgradeable toccx43). Do not remove the build-node taint to "fix" scheduling; add the node selector/toleration to the PipelineRun template.ProxyGroup auth must target the intended context.
kubectl --kubeconfig ~/.kube/configdoes not select a cluster by itself; it still uses that file's current context. For dev/staging/hub repairs, minify the intended context into a temporary kubeconfig or setKUBECONFIGto the Crossplane fallback kubeconfig before runningdeployment/scripts/tailscale/proxygroup-auth.sh. For kube-apiserver ProxyGroups, the script patches the*-configsecret, not justTS_AUTHKEYenv.Ryzen reconciles
mainDIRECTLY; pushing tomainIS how content reaches ryzen. Theinner-loopbranch is RETIRED (deleted). Ryzen's LOCAL ArgoCDroot-ryzentrackspackages/overlays/ryzen@main—commit-pin.sh(and any manifest edit) just commits tomain; ryzen reconciles on its next poll, or force an immediate re-compare withdeployment/scripts/ryzen-sync.sh(hard-refreshesroot-ryzen). There is NOenv/spokes-ryzen, NO source-hydrator, NO Promoter on the ryzen lane, so the empty-drySource.kustomizehydrator-stall bug does NOT apply — if a frozen ryzen, hard-refreshroot-ryzen, don't look for aninner-loopadvance.Manual branch reconciliation is not part of the normal ryzen loop. Ryzen-related Applications source GitHub
maindirectly via the localroot-ryzen; there is no separate ryzen branch to keep in sync.argocd-hub.tail286401.ts.networks even when other Tailscale ProxyGroups are down. It's an independent ProxyGroup. When per-spoke Tailscale access is broken, you can still drive ArgoCD ops from the hub viaargocd login argocd-hub.tail286401.ts.net --grpc-web.GitOps Promoter app releases may be newer than the Helm chart appVersion. Verify both upstream release and Helm chart metadata. As of 2026-04-24, the controller runs
v0.27.1; the latest Helm chart is0.6.0withappVersion: 0.26.2, so stacks keeps chart0.6.0and overridesmanager.image.tag.Promoter UI patch hooks need a shell-capable kubectl image.
registry.k8s.io/kubectlis distroless and has no/bin/sh; usealpine/k8s:<version>for shell-scripted hook jobs. The ArgoCD Helm chart's server container is namedserver, even though the Deployment isargocd-server.Hub source-hydrator status can pin a stale dry SHA. If
root-application.status.sourceHydrator.currentOperation.drySHAstays behindorigin/main, removecurrentOperationandlastSuccessfulOperationfrom status and hard-refresh the app. Seerunbooks/manage-gitops-promoter.md.Drizzle Kit silently skips SQL files lacking
_journal.jsonentries. Thedb-migrateSync hook on dev/staging runsnpx drizzle-kit migrate, which globsdrizzle/*.sqlBUT only applies files with a matchingentries[]tag indrizzle/meta/_journal.json. Job exits 0 either way — easy to miss. Always update the journal when adding a migration; older files in the repo (0006/0007/0020/0032/0037-0043) lack journal entries because their columns were applied via out-of-band paths historically. Prompt Workbench'sresource_prompt_versionstable is one of the checks to run after prompt-preset deploys. Seerunbooks/fix-drizzle-migration.md.Two migration runners read from two different directories.
src/lib/server/startup.tsreads fromatlas/migrations/(timestamp-prefixed);npx drizzle-kit migratereads fromdrizzle/(incremental + journal-gated). The production image'sDockerfilecopiesdrizzle/but.dockerignoreexcludesatlas/, so the atlas-runner is effectively only active in the ryzen Skaffold dev pod (which file-syncs source). New migrations usually need to live in BOTH dirs, both idempotent (ADD COLUMN IF NOT EXISTS).Source-hydrator polls every ~3 min. After release metadata lands on
origin/main, expect 5-8 min before dev's pod is rolling on the new image, then staging waits for its configured soak timer after health.argocd app refresh --hardtriggers manifest re-render but does NOT immediately repoll branch tips.argocd app sync --revision <sha>is rejected on auto-sync + branch-tracking apps (Cannot sync to <sha>: auto-sync currently set to <branch>). Don't hard-sync; wait. Seerunbooks/track-promotion-state.mdfor what's-actually-stuck triage.Generated
env/spokes-*branches need guardrails. If the generated app directory drifts fromenv/spokes-*-next, usescripts/gitops/reconcile-spoke-generated-dir.sh <dev|staging> check|fix; do not hand-edit generated env branches unless the script proves the root and child dry SHAs match.git status --porcelainRprefix means "renamed in INDEX, already staged". Filtering withgrep -E "^A |^M "MISSES it. After a stalegit addor interrupted commit, your nextgit commitwill scoop in any pre-staged renames/deletes alongside what you intended. Before committing, eithergit reset HEAD --to clear the index then re-stage exact paths, or usegit diff --cached --name-status(which shows ALL staged changes including renames + deletes + mode changes).SWE-env build cache lock needs THREE layers, not one.
Task-swebench-inference-image-build-push.yamlacquires/var/lib/containers/.swebench-buildah.lockviamkdir. The shelltrapreleases on graceful exit but SIGKILL is uncatchable (OOMKill, eviction, controller force-delete bypass it). Tekton'sretries:1then spawns a retry pod that inherits the parent TaskRun name, so the dead pod'sownerfile saystaskRun=<the same name>and the retry sees its own predecessor's lock as held by "another PR" → spin-polls forever → PR never terminates → Pipelinefinally:never runs → deadlock. Fix is committed (stacks52bb0b18+f6f4bb00+d450c9b1); know the symptom: retry-pod logs sayBuildah cache lock is held by: taskRun=<own name>and original pod isOOMKilled. The 3 layers: (1) self-takeover inacquire_buildah_cache_lock— if owner taskRun matchescontext.taskRun.name, remove + reacquire; (2) Pipelinefinally:taskrelease-buildah-cache-lock— removes lock if owner starts with$(context.pipelineRun.name)-; (3)BUILDAH_CACHE_LOCK_STALE_SECONDS=1800(was 21600 — 6h was way too long). Also: build-and-push memory bumped 2Gi req → 4Gi req / 6Gi limit so OOMKills become rare AND when they happen the kernel kills the offending container only (predictable). If a stale lock recurs, manual clear: spin a busybox pod with thebuildah-cache-swebench-inferencePVC mounted andrm -rf /cache/.swebench-buildah.lock.K8s label values must match
(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?— alphanumeric start AND end. Nanoid-generated IDs (workflow-builder benchmark runs, etc.) can legally end in_or-. When a service usesrun["id"]directly as a label value (e.g. swebench-coordinator's evaluator Job), the API rejects creation with HTTP 422Invalid value: "<id>": a valid label must ... start and end with an alphanumeric character. Symptoms: run goesfailedimmediately at the evaluating stage with no useful inference error; harness eval never executes. Fix at the use site, not the ID generator (don't break existing data): trim outer[._-]characters and cap at 63 chars. Coordinator fix landed at workflow-builder0f369b58via_safe_label_valuehelper at lines 471/485 ofservices/swebench-coordinator/src/app.py. If you add a new service that labels K8s resources with run/instance IDs, mirror the helper.SWE-bench random readiness is exact on the current environment identity. The Benchmarks page and
POST /api/benchmarks/runsmust use the same readiness logic as coordinator preflight. Static ConfigMap pins are ready when suite/repo/baseCommit/version and image digest match, even ifenvironmentSetupCommitis absent. Dynamic DB build rows are ready only whenenvironment_image_builds.env_spec_hashequals the currentbuildSwebenchEnvironmentSpec()hash. Do not fall back to repo/version/baseCommit-only DB matching; that admits old images and leaves runs parked in preflight while hub builds a new image.
Update (2026-05-19) — image delivery while the 2nd EL is still dead, + dev-portability of utility images
- Ryzen sync is locally-driven. Ryzen's OWN argocd-application-controller reconciles
root-ryzen@mainon its poll interval OR responds torefresh=hardannotations on the localryzen-*Applications (--context admin@ryzen). Ryzen has its own local ArgoCD (autonomous agent); no local Gitea. Normal health target is everyryzen-*ApplicationSynced/Healthyon ryzen's local ArgoCD. - Working ryzen image pattern: outer-loop GitHub lane builds + pushes
ghcr.io/pittampalliorg/<img>:git-<sha>. To deliver to ryzen specifically, editpackages/components/workloads/<comp>/manifests/kustomization.yamlimages:tonewName: ghcr.io/pittampalliorg/<img>+newTag: git-<sha>and commit/merge tomain(or usecommit-pin.shautomatically, which pushes tomain). Ryzen's local ArgoCD re-rendersoverlays/ryzen@mainand rolls the pod. Ryzen Deployments already have theghcr-pull-credentialsimagePullSecret materialized by ESO from KVGITHUB-PAT. Exception since C1 (2026-06-04): workflow-builder + workflow-mcp-server have NO bareimages:block —commit-pin.sh workflow-builderwrites the flatrelease-pins/workflow-builder-images-ryzen.yamlAND renders + commits theworkflow-builder-ryzen-imageComponent LOCALLY in the same push (wfb PR #37; stacks CI is just a drift-net); do not hand-edit anewTagfor those two (see the "two visible truths" gotcha). - Preserve workloads pins in git. Ryzen renders whatever workloads pin is committed to
main. Image pins committed tomainare the same pins dev/staging consume via their release-pins / Promoter outer-loop path. dev-*apps sourceworkloads/*/manifests/@ origin/main HEAD (shared with ryzen, NOT a per-spoke render).dev-swebench-coordinator,dev-swebench-evaluator-tekton,dev-workflow-builderArgoCD apps point atgithub.com/PittampalliOrg/stacks.gitpathpackages/components/workloads/<comp>/manifests/revHEAD. So a commit toorigin/mainin workloads delivers to dev automatically; the spoke-workloads ApplicationSet additionally rewrites the release-pins workload images onto those apps'spec.source.kustomize.images(swebench-coordinator/evaluator/workflow-builder are release-pins keys — outer-loopupdate-stacksalready bumps them). The base workloads pins are the ryzen value; dev's image comes from the release-pins override.- Utility/init images pinned to ryzen Gitea break dev/staging (
Init:ImagePullBackOff).workloads/{swebench-coordinator,evaluation-coordinator}/manifests/kustomization.yamlrewrotebitnami/kubectl(+alpine/k8s) →gitea.cnoe.localtest.me:8443/giteaadmin/kubectl(ryzen's in-cluster Gitea, unreachable from dev/staging). The spoke ApplicationSet only rewrites release-pins workload images, NOT utility/init images, so those Deployments'wait-for-workflowstatestoreinit containers were permanentlyInit:ImagePullBackOffon dev/staging — they had never run there; only ryzen worked (local mirror). Fix (stacks #1707): mirror ryzen's kubectl image →ghcr.io/pittampalliorg/kubectl:latest(skopeo copy --src-tls-verify=false docker://gitea-ryzen.tail286401.ts.net/giteaadmin/kubectl:latest docker://ghcr.io/pittampalliorg/kubectl:latest, dest-auth = hubghcr-push-credentials; as an agent use the SSH-wrappedssh vpittamp@ryzen '…'form to dodge the bash-tool Production-Reads guard) and rewrite both kustomizations gitea→ghcr.io/pittampalliorg/kubectl(all-spoke; ryzen pulls GHCR fine). General rule: any workloads manifest that rewrites a utility image togitea.cnoe.localtest.me:8443/...is dev/staging-broken by construction. - ArgoCD won't advance to a new commit despite autoSync/selfHeal? Symptom: app
OutOfSyncat the new rev,operationState.phase=Running … retrying attempt #N, a DeploymentProgressDeadlineExceededbehind a stuck old pod (e.g. the prior pod wasInit:ImagePullBackOff). The stuck in-flight sync op blocks the new revision. Recovery:argocd app terminate-op <app> --grpc-web(loginargocd-hub.tail286401.ts.netwithargocd-initial-admin-secret). Terminate alone is usually sufficient — once the stuck op is killed,autoSync/selfHealapplies the current desired revision within ~1 min.argocd app sync --forcetypically keeps returninganother operation is already in progresswhile the terminate winds down — don't fight it; wait for selfHeal. (runbooks/recover-stuck-promotion.mdhas the full procedure.) - Benchmark-run Dapr-lifecycle recovery lever:
POST http://<bff>/api/internal/benchmarks/runs/<runId>/cleanupheaderx-internal-token: $INTERNAL_API_TOKENbody{}runs the documented terminal-cleanup teardown (its cascade is now generalized into — and shared with — the vetted Lifecycle Controller,src/lib/server/lifecycle/). DB-cancel alone does NOT terminate the durable Dapr session workflows (they keep re-spawning openshell sandboxes); the session-termination path only fires when the run is cancelled, so set runs+instancesstatus='cancelled'first, then call cleanup (expect retries; coordinator+DB must be up). As a passive backstop, the lifecycle-terminal-reaper CronJob reconciles stuck rows on a timer, but it deliberately skips while a benchmark run/lease is active, so for a live run you still drive cleanup explicitly. See theevaluationsskill's "System State Update (2026-05-19)".
Dev SWE-bench concurrency envelope
Practical limits for /workspaces/<slug>/benchmarks are layered. The current intended dev GitOps values are:
| Layer | Knobs | Intended dev value |
|---|---|---|
| Launch/BFF default | BENCHMARK_DEFAULT_CONCURRENCY |
10 |
| Capacity mode | BENCHMARK_CAPACITY_MODE |
auto |
| Execution backend/class | BENCHMARK_EXECUTION_BACKEND / BENCHMARK_EXECUTION_CLASS |
dapr-kueue / benchmark-fast |
| Full-instance Kueue model | BENCHMARK_KUEUE_INSTANCE_REQUEST_MODE |
host-worker-composite |
| Lease resources | BENCHMARK_KUEUE_LEASE_RESOURCES |
openshell_sandbox,dapr_workflow_slot |
| Shared coding pool | AGENT_RUNTIME_POOL_APP_IDS_JSON |
agent-runtime-pool-coding, maxReplicas=16, slotsPerReplica=12 on dev |
| Dedicated coding fallback | AGENT_RUNTIME_SLOTS_PER_REPLICA_JSON |
coding=12 |
| Per-sidecar Dapr workflow cap | AGENT_RUNTIME_DAPR_WORKFLOW_LIMIT_PER_SIDECAR |
12 |
| Coordinator start pacing | SWEBENCH_COORDINATOR_INSTANCE_START_BATCH_SIZE / ...DELAY_SECONDS |
unset or 0 / 0 for full effective-concurrency fan-out |
| Evaluator parallelism | SWEBENCH_EVAL_MAX_PARALLEL |
24 |
Per-instance peak draw during inference is modeled as the full sandbox/worker +
agent-host bundle. The capacity snapshot fields kueueInstanceRequest*,
kueueInstancePodCount, kueueAvailableInstanceSlots, and
schedulableKueueInstanceCapacity are the deployment-time truth; do not infer
safe concurrency from the launch slider or sandbox-only capacity. Per-run
harness evaluation adds Kueue-admitted evaluator TaskRuns. Before raising a run,
verify live node headroom, Kueue quota, Dapr runtime readiness, model/provider
rate limits, and exact-ready image coverage.
Current clean dev checkpoint: run W4ZmHxaEMEYQDCZ_Ypo41 completed 25 distinct
exact-ready SWE-bench_Verified instances with DeepSeek V4 Pro at maxTurns=25.
It requested/effectively ran inference 25/25; evaluator requested/effective was
24/9 because Kueue clamped eval capacity. Result was 13 resolved / 7 unresolved
/ 5 empty-patch, zero evaluator errors, zero hard errors, zero active leases
after cleanup, and no Dapr activity-registration failures. Treat 25 as proven;
do not jump above it without a clean launch gate and exact-ready preview.
Ryzen runs the same composite capacity model but has much less request
headroom. The 2026-05-27 ryzen canary MPIlRkKWC7UdvHgwFQEiR selected 3
exact-ready instances and was correctly capped to effective concurrency 2 by
kueue_capacity; all three instances inferred/evaluated and active leases
returned to zero. Keep ryzen benchmark campaigns sequential even when a single
run can safely use multiple effective slots.
What to read next
| If the task is… | Read |
|---|---|
| New to the system / orienting | reference/architecture.md |
| Need kubectl / argocd on a cluster | reference/access-paths.md |
| Deciding whether an app belongs on hub or a spoke | reference/app-placement.md |
| Anything secret-related (rotation, audit, debugging) | reference/secret-flow.md |
| Bumping an image to dev/staging | runbooks/promote-image-to-spokes.md |
| Post-push workflow-builder rollout verification on ryzen/dev | runbooks/track-promotion-state.md |
| Prompt Workbench or prompt preset DB/API changes after rollout | runbooks/track-promotion-state.md + workflow-builder references/prompt-workbench.md |
| SWE-bench evaluator promotion or Benchmarks page canary validation | shared-skills/evaluations/SKILL.md + runbooks/track-promotion-state.md |
| Agent is silent after adding MCP/OAuth connection; ActivePieces piece MCP catalog, Knative KServices, per-session MCP bootstrap, or Dapr statestore scope is suspect | runbooks/debug-workflow-builder-mcp-auth.md |
workflow-builder pod shows 1/2, daprd readiness is false, or openshell-agent-runtime / swebench-coordinator is unavailable |
runbooks/debug-dapr-sidecar-stale-readiness.md |
| workflow-builder works on ryzen but not dev/staging | runbooks/reconcile-workflow-builder-spoke-environment.md |
| Moving a ryzen-validated image to dev/staging | runbooks/promote-image-to-spokes.md |
| Image missing on ghcr.io (outer-loop build didn't run) | runbooks/debug-funnel-orphan-tag.md |
Validate Workflow Builder Release Pins CI failing with denied on every image |
runbooks/grant-stacks-ghcr-package-access.md |
| Bumping runtime images such as browser-use-agent-sandbox, dapr-agent-py-sandbox, or claude-agent-py-sandbox (per-session env-var images outside release-pins) | runbooks/bump-image-pin-not-in-release-pins.md |
| Editing a workflow JSON spec (maxTurns, prompt, agentKwargs, …) and rolling the change to dev/staging | runbooks/upsert-workflow-json.md |
| Upgrade GitOps Promoter or repair its ArgoCD UI extension | runbooks/manage-gitops-promoter.md |
| Review all OutOfSync/Degraded apps and decide keep vs remove | runbooks/review-argocd-app-health.md |
| ArgoCD operationState stuck Running | runbooks/recover-stuck-promotion.md |
| db-migrate Job stuck Terminating | runbooks/recover-stuck-job-finalizer.md |
| Webhook not firing / hub Tekton path broken (NXDOMAIN or 202-no-PipelineRun) | runbooks/debug-funnel-orphan-tag.md |
Device-backed Tailscale Ingress missing address, using -1, or blocked by stale service/device records |
runbooks/debug-device-backed-tailscale-ingress.md |
| ProxyGroup service-host missing address or cert domain | runbooks/debug-proxygroup-service-host.md |
| Migration shipped but columns missing on dev | runbooks/fix-drizzle-migration.md |
| Track a promotion in flight / what's gating it | runbooks/track-promotion-state.md |
| Spoke kubectl when Tailscale down | runbooks/access-spoke-cluster-fallback.md |
| Rotate a per-spoke OAuth client secret | runbooks/rotate-oauth-secret.md |
The runbooks each follow the same shape: Symptoms → Diagnostic → Fix steps → Verify.
CLIs the agent should assume are available
| Tool | Typical use here |
|---|---|
kubectl |
Multi-context via ~/.kube/config; for hub use --kubeconfig ~/.kube/hub-config (no SSH wrapper when on ryzen) |
argocd |
Login via Tailscale: argocd login argocd-hub.tail286401.ts.net --grpc-web (admin password in argocd-initial-admin-secret). Use for terminate-op, app sync --force, things kubectl-patch can't do |
gh |
GitHub API (webhook delivery history, OAuth app metadata, PR/run inspection); already authenticated as vpittamp |
op |
1Password CLI — the hub secret root since 2026-06: op read 'op://hub-eso/<item>/<field>' reads hub secrets; op://CLI/<id>/credential holds the hub-eso-reader SA token. Replaces az for hub secrets |
az |
Azure KeyVault (keyvault-thcmfmoo5oeow) — DORMANT for the hub (Azure KV + AD App + OIDC/JWKS not deleted but no longer in the hub recreate path); still relevant for spoke OAuth/cert material. az keyvault secret show --query attributes.updated -o tsv for rotation-time audits |
skopeo |
Legacy: mirror images from ryzen's local Gitea PVC to ghcr.io (only needed for unrecovered pre-A6 artifacts). Use --dest-authfile with hub's ghcr-push-credentials secret. Run from ryzen (DNS) |
talosctl |
Hub: --talosconfig ~/.talos/hub-config; Talos cluster (Hetzner): ~/.talos/talos-config. Spokes don't have ready-made talosconfig — use kubeconfig fallback |
hcloud |
Active context stacks (hcloud context list). hcloud server list for full Hetzner topology |
tailscale |
status --json for orphan-tag diagnosis; serve status / funnel status from inside operator pods |
git |
Push app/source repos and promoted stacks changes to origin unless a task is explicitly about historical Gitea recovery |
Repo paths cheat-sheet
| Path | Role |
|---|---|
packages/components/hub-spoke-appsets/release-pins/workflow-builder-images.yaml |
dev/staging release metadata source; edit with the generated overlays |
packages/components/workloads/workflow-builder-system-overlays/{dev,staging}/kustomization.yaml |
generated dry-source overlays consumed by source-hydrator; contains release-pin images and per-spoke runtime env |
packages/components/hub-spoke-appsets/apps/spoke-workloads-appset.yaml |
The cluster-selecting ApplicationSet that points source-hydrator at each generated workflow-builder-system overlay |
packages/base/manifests/tailscale-ingresses/ |
Shared *-CLUSTER-placeholder tailnet exposures for promoted-spoke app hostnames: device-backed Tailscale Ingresses (e.g. phoenix-*) plus the workflow-builder L4 LoadBalancer Service-workflow-builder-tailnet.yaml (PR #2319) |
packages/components/workloads/<image>/manifests/kustomization.yaml |
Per-image workloads kustomization; current ryzen delivery uses ghcr.io refs reconciled by ryzen's local ArgoCD (root-ryzen) from main. Exception: workflow-builder + workflow-mcp-server — bare images: deleted (C1); ryzen pin lives in the rendered Component below |
packages/components/workloads/workflow-builder-ryzen-image/kustomization.yaml |
Render-generated kustomize Component carrying the ryzen workflow-builder + workflow-mcp-server image pin (newName/newTag); components:-included by workflow-builder/manifests/kustomization.yaml. commit-pin.sh RENDERS + COMMITS it LOCALLY (wfb PR #37); stacks CI .github/workflows/render-ryzen-image.yml is a drift-net that re-renders on a diff — do NOT hand-edit |
packages/components/hub-spoke-appsets/release-pins/workflow-builder-images-ryzen.yaml |
Flat ryzen pins file (images/imageRefs/digests/sourceShas) that commit-pin.sh upserts for workflow-builder + workflow-mcp-server; commit-pin renders the Component above from it locally in the same push (the CI render-net re-derives it too). The render consumes ONLY the workflow-builder + workflow-mcp-server rows — other services' rows here are inert |
.github/workflows/render-ryzen-image.yml |
stacks CI drift-net: re-renders workflow-builder-ryzen-image from the flat ryzen pins file on each touching push (runs render-workflow-builder-release-overlays.sh with WFB_RENDER_ENVS=ryzen), commits only on a diff — NO-OPs when commit-pin's local render already matches |
packages/components/workloads/workflow-builder/manifests/Deployment-workflow-builder.yaml |
AGENT_RUNTIME_*_DEFAULT_IMAGE env vars that the BFF reads per-session to select the agent-sandbox Sandbox pod image (separate from release-pins; see gotchas) |
packages/components/workloads/activepieces-mcps/manifests/ |
All-catalog reconciler that turns workflow-builder DB mcp_connection rows (+ pinned pieces) into cluster-local per-piece ap-<piece>-service piece-runtime KServices (one piece-mcp-server image / PIECE_NAME, serving /execute + /mcp + /options + /health; needs NODE_OPTIONS=--max-old-space-size=400 + 512Mi) and activepieces-mcp-catalog |
packages/base/manifests/knative-serving/kustomization.yaml |
Knative Serving install and autoscaler config, including allow-zero-initial-scale needed by generated piece-runtime services |
| `packages/components/workl |
Content truncated for page performance. Open the source repository for the full SKILL.md file.