name: workflow-builder description: "Author, visualize, and debug SW 1.0 workflows for the workflow-builder app. Use for agent durable/run steps, session goal loops (Codex /goal parity, goal MCP tools, token budgets), Session Pulse vitals (cost/context/cache-hit), Prompt Workbench prompts/presets/previews, trigger schemas, jq expressions, action slugs, workflow or agent MCP connections, ActivePieces piece MCP auth, canvas JSON, DB-backed workflow rows, failed workflow runs, silent agents, prompt caching considerations, Dapr workflow sidecar readiness, pod 1/2 due to daprd, Claude Agent SDK / claude-agent-py runtime routing, SWE-bench sandbox/runtime distinctions, and workflow-orchestrator/openshell-agent-runtime troubleshooting across PittampalliOrg/workflow-builder and PittampalliOrg/stacks."
Workflow Builder
Author SW 1.0 workflows for the workflow-builder app, get them rendered in the canvas, and run them end-to-end. Diagnose runtime failures using the cluster topology of the per-agent runtime model, including MCP connection resolution and Dapr durability state.
Mental model in one paragraph
A workflow is a CNCF Serverless Workflow 1.0 document with workflow-builder extensions, stored in the workflows table as three JSONB columns: spec (the SW 1.0 doc), nodes + edges (SvelteFlow canvas representation derived from the spec). The Python workflow-orchestrator pod parses the spec at execution time and dispatches each task: most go through function-router via Dapr service-invoke, but durable/run agent steps are dispatched as Dapr child workflows (the workflow literal session_workflow) to the selected agent runtime app id. The runtime is resolved through the runtime registry (services/shared/runtime-registry.json, the SSOT consumed by both the Python orchestrator and the TS BFF) — dapr-agent-py, claude-agent-py (Claude Agent SDK; Anthropic-only; now supports MCP), adk-agent-py, or browser-use-agent. Non-browser runtimes run as per-session ephemeral agent-sandbox Sandbox pods (Kueue-admitted, self-reaped on session end) that differ only by container image; browser-use-agent uses a SandboxWarmPool carve-out for Chromium boot latency. A legacy static dapr-agent-py Deployment survives only for the openshell-durable-agent enum + the agent-runtime-pool-coding benchmark pool. There is no longer an AgentRuntime CRD, Kopf agent-runtime-controller, or per-agent wake/idle annotations — that model is retired. A swap-safety gate (src/lib/server/agents/swap-safety.ts) compares an agent's required capabilities (MCP/hooks/permission/provider/durability) against the target runtime's declared capabilities before dispatch. Agent MCP tools are configured in the Tools & Integrations surface (a component tree that replaced the old agent-mcp-picker): each config.mcpServers[] entry carries allowedTools, narrowed two-level — the project mcp_connection.metadata.toolSelection ceiling ∩ the per-agent selection (absent = all tools, [] = none) — surfaced as an attach-list + include-all toggle (which replaced the explicit/project/auto select); the ?tools= filter is enforced on both resolvers. OAuth-backed ActivePieces tools pass only X-Connection-External-Id (reference-forwarding); credentials stay in app_connection storage and are decrypted only by workflow-builder's internal API. Agent prompt authoring lives in the Prompt Workbench: it edits the saved AgentConfig prompt fields, applies project-scoped presets from resource_prompts / resource_prompt_versions, renders Mustache variables for authoring preview only, and still publishes through the existing agent save/version flow. The SvelteKit BFF is a UI + proxy layer; everything durable lives in Dapr. Edits to a workflow's spec are picked up at the next execution, not by image rebuild.
When to use this skill
Trigger on any of: "build a workflow", "author a workflow", "add an agent step", "add a trigger", "make this run on a webhook", "the run failed", "the agent never starts", "the canvas is empty", "${ .trigger.x } isn't resolving", "what slugs are available", "why isn't my sandbox persisting", "why is daprd crashing", "where does my workflow run", "set a goal on this session", "the goal loop stopped continuing", "the goal budget burned instantly", "Session Pulse cost/context looks wrong".
Not this skill — use dapr-agents-workflow instead if the user wants to hand-write upstream dapr/dapr-agents Python in a portable Dapr app: DurableAgent, AgentRunner, @wfr.workflow / WorkflowRuntime / DaprWorkflowContext, call_activity / call_agent, OrchestrationMode multi-agent orchestration, @tool, or MCPClient. THIS skill is about authoring the workflow-builder product — SW 1.0 JSON specs, the canvas, durable/run dispatch, agentConfig, Prompt Workbench, ActivePieces piece MCP, and the orchestrator/sandbox cluster ops — where agent runtimes are opaque, registry-selected dispatch targets (the orchestrator calls ctx.call_child_workflow("session_workflow") for you; you never write a DurableAgent).
Quick decision tree
| The user wants to… | Do this |
|---|---|
| Add an HTTP call | Copy assets/minimal-http.workflow.json. Read references/sw-1.0-spec.md for jq rules. |
| Add an agent step (call Claude/GPT in a sandbox) | Copy assets/minimal-agent.workflow.json. Read references/agent-task.md. |
| Edit an agent persona or manage prompt presets | Use the agent detail Prompt Workbench. Read references/prompt-workbench.md; presets write into unsaved agent config until the normal save/publish flow runs. |
| Debug a workflow agent-node compiled prompt preview | Read references/prompt-workbench.md and references/agent-task.md. The preview shows the canonical Dapr shape: system message, chat_history, appended node prompt. |
Surface a workflow's output in the UI (instead of burying it in output JSONB) |
Attach an artifacts: block to the task. Copy assets/with-artifacts.workflow.json. Read references/workflow-artifacts.md — covers spec shape, the ${ .data.X } post-task context, and where rows land in the run-detail Overview + Outputs tabs. |
| Take user input at run time | Use the input.schema block from assets/trigger-schema.snippet.json; reference fields as ${ .trigger.<name> }. Read references/authoring-recipe.md § Trigger inputs. |
| Share a sandbox between a coding step and an agent step | Copy assets/workspace-keepalive.workflow.json. Read references/agent-task.md § Sandbox bridging. |
| Attach or debug MCP tools on an agent/workflow | Read references/mcp-connections.md, then references/agent-task.md § MCP servers. |
Discover what actionType slugs exist |
Call GET /api/action-catalog (see references/action-catalog.md) — don't guess. |
| Insert a finished workflow into the DB | Run scripts/upsert-workflow.py <file.json>. It POSTs to the BFF (which stamps project_id) and PUTs the spec column. |
| Diagnose a failed run | Read references/troubleshooting.md and triage by symptom (parse error / agent timeout / replay chatter / prompt-too-long / project_id NULL). |
| Stop / terminate / purge a running session or workflow run (or wonder why "Stop" did nothing) | Route through the vetted Lifecycle Controller — POST /api/v1/sessions/[id]/stop or POST /api/workflows/executions/[id]/stop with {mode: interrupt|terminate|purge|reset}. Read references/troubleshooting.md § Stopping a run (Lifecycle Controller) + the SSOT docs/workflow-lifecycle-termination.md. |
| Set a persistent objective on a live session (autonomous goal loop), manage/pause/re-arm it, or debug a goal that stopped continuing | UI Goal card on session detail, GET/POST/PATCH /api/v1/sessions/[id]/goal, or have the agent use the auto-wired goal MCP tools. Read references/goal-loop.md. |
| Session token/cost/context numbers look wrong, or a goal budget burns far too fast on one provider | Read references/goal-loop.md § Usage-event convention — check agent.llm_usage input_tokens vs cache_read_input_tokens for gross/subset semantics. |
Debug 1/2 pods or daprd not ready |
Read references/cluster-topology.md for the runtime model, then use the gitops runbook runbooks/debug-dapr-sidecar-stale-readiness.md for live Kubernetes triage. |
| Run official SWE-bench or Benchmarks UI work | Use the evaluations skill; those paths are intentionally outside normal SW 1.0 workflow authoring. |
| Confirm a freshly-inserted workflow shows up + runs | Read references/verify-in-ui.md. |
| Understand "where does my workflow actually run?" | Read references/cluster-topology.md. |
| Prove a Claude/SWE-bench run used the right runtime | Read references/troubleshooting.md and the evaluations skill. Trust benchmark_runs.agent_runtime, agent_runtime_app_id, workflow output agentRuntime, agentWorkflowMode, trace IDs, and outputSync before container labels. |
| Watch a build/promotion/sync land on ryzen + dev live, or debug the GitOps pipeline view | Open /admin/gitops/system (the event-driven "Kargo lens" pipeline, fed by hub Argo Events → gitops_activity_events → SSE). The header build <sha> badge is the running image on THAT cluster. See the gitops skill § Event-driven activity stream. |
| See an image's build status + the Commit→Build→Pin→Promote→Deploy chain | Same view: stage cards carry a build chip (Built/Building/Failed + duration + Tekton deep-link) and the node drawer has a Delivery timeline (inter-step gaps + durations + a commit→live lead-time, lane-aware Promote). This is inventory-sourced (the hub inventory's per-app build/promotion/live + imageHistory), NOT the Argo-Events stream — see the gitops skill § Event-driven activity stream → Build feedback + delivery timeline. |
| Get notified when a deploy goes live (any page) | App-wide deployment notifications (admin-gated): a toast + a sidebar notification bell fire when a component's LIVE image tag changes on a cluster. INVENTORY-diff (tag-SET diff of live.images), not the event stream; store at src/lib/stores/deployment-notifications.svelte.ts, started in the root layout. See the gitops skill § Event-driven activity stream → App-wide deployment notifications. |
Critical gotchas (memorize these — they cost the most time)
These are the failure modes that look like obscure bugs but are actually doing-it-wrong. Each entry has the why so you can judge edge cases instead of robotically applying the rule.
jq is full-string-only.
is_expression_string(inservices/workflow-orchestrator/core/sw_expressions.py) only evaluates a value if the entire string starts with${and ends with}. So"${ .trigger.url }"evaluates;"prefix ${ .trigger.url }"passes through as literal text. To interpolate, concat inside one expression:"${ \"prefix \" + .trigger.url }".Trigger context is
.trigger, not.input.tc.task_outputs["trigger"] = {label, actionType, data: trigger_data}— the orchestrator's expression context exposes the unwrapped data under${ .trigger.<field> }(seeservices/workflow-orchestrator/workflows/sw_workflow.py,tc.task_outputs["trigger"]["data"] = tc.trigger_data).${ .input }resolves to a different thing (per-task input).Inside an
artifacts:block,${ .data.X }is the just-completed task's payload, not a cross-task ref._persist_task_artifactsbuilds a per-task context that strips two envelopes ({label,actionType,data}storage wrapper +{success,data,error}call wrapper) to reach the payload, and exposes it uniformly so the same idiom works for crawl-style nested payloads (.data.markdown) and agent-style flat ones (.data.content— the orchestrator wraps the flat payload as{data: payload}so the canonical idiom holds). For cross-task refs use the full task name:${ .fetch_each.data.tier }. Seereferences/workflow-artifacts.md§ post-task expression context.Trigger schema has TWO equivalent placements. Either top-level
spec.input.schema.document(canonical, preferred) ORspec.document['x-workflow-builder'].input.schema(alternate). The spec→graph adapter normalizes both into the start node'sdata.taskConfig.input(seesrc/lib/utils/spec-graph-adapter.ts:79-94). Pick one and stick with it; when in doubt use the canonical placement.Node IDs equal task names. The key in each
do[]entry IS the node ID in the canvas.__start__and__end__are the synthetic entry/exit nodes. The adapter uses@serverlessworkflow/sdk::buildGraph()so 99% of the time you should let the spec drive node generation rather than hand-authornodes/edges.durable/runis a Dapr child workflow, not an HTTP call. It bypasses function-router. The orchestrator yieldsctx.call_child_workflow("session_workflow", app_id="<runtime-app-id>")(the dispatched workflow literal issession_workflowperruntime-registry.jsondispatchWorkflowName; the distinct sentinelagent_workflowis only the bridge-eligibility token). The runtime is resolved by_resolve_native_agent_runtimeinsw_workflow.py, now a thin shim overcore/runtime_registry.resolve(). The runtime app id comes fromwith.agentRef.id→ DBagents.runtime_app_id(SWE-bench pool agents useagent-runtime-pool-coding). MissingagentRef/agentSlugfalls back to the registry'sdefaultRuntimeId(dapr-agent-py). The target pod must be in the same namespace (workflow-builder) — Dapr workflow sub-orchestration doesn't cross namespaces.Claude Agent SDK is a peer runtime, not a sandbox template.
services/claude-agent-pyruns the Claude SDK path and should reportagentRuntime=claude-agent-py/agentWorkflowMode=claude-agent-sdkin workflow outputs. It is Anthropic-only, runs the whole agent loop in one Dapr activity (durabilityGranularity: per-turn, vsdapr-agent-py's per-activity), owns its own sandbox, and now supports MCP —agentConfig.mcpServersis wired into the SDK (capabilities declared inruntime-registry.json). It is not proved or disproved by seeingsandboxTemplate: "dapr-agent"; that template names the OpenShell workspace image used byworkspace/profile. For SWE-bench there are two sandboxes: the repo testbed environment (for exampleswebench-inference-astropy-1.3) and the agent-host/runtime sandbox. The old-lookingdapr-agentordapr-agent-pycontainer label can be a static/legacy label; use DB fields, workflow output, traces, and image/env pins as truth.Model defaults are current keys, not historical agent names. The Anthropic default is
anthropic/claude-opus-4-8(runtime defaultclaude-opus-4-8insideservices/claude-agent-py), and the OpenAI GPT default isopenai/gpt-5.5when that key is available insrc/lib/agents/model-options.tsand the corresponding stacks component/env is present. Olderclaude-opus-4-7/gpt-5.4rows are legacy unless intentionally pinned for comparison.Stopping a run goes through ONE vetted Lifecycle Controller — don't hand-roll terminate/purge.
stopDurableRun(target, {mode})(src/lib/server/lifecycle/{index,cascade,resolvers,reaper,ownership}.ts) is the single server-side entry point for stopping any durable run (target.kind ∈ workflowExecution | session | evalRun). Modes:interrupt(cooperative halt of the current turn, keeps the run — no purge/reap/DB-flip),terminate(hard-stop the durable tree, no purge),purge(terminate → confirm terminal → recursive Dapr purge + reap the Sandbox CR + flip DB rows terminal),reset(purge + force-delete state rows even if Dapr never confirmed terminal — the user-reachable "Stop & reset" byte-clean mode, scope-guarded). Cooperative-first: terminate/purge/reset give a short grace (LIFECYCLE_TERMINATE_GRACE_SECONDS, default 5s) so the agent's cancel-key can halt at the next turn/tool boundary before forcing. Every user "stop" routes through it:POST /api/v1/sessions/[id]/stopandPOST /api/workflows/executions/[id]/stop(body{mode, reason?, graceMs?}), sessioncontrol/interrupt, eval/benchmark run cancel. Generalized from — and shared with — the benchmark cancellation cascade. UI: Stop / Stop & Reset on session-detail + workflow-run pages.Stop is request/confirm (202 "stopping"), NOT a one-shot fail-closed 409. A stop persists a
stop_requested_atintent (migration 0071) and returns HTTP 202 "stopping" while the durable tree converges; it only flips DB / reaps once Dapr is confirmed terminal — finalized by theGET …/stop/statuspoll (→confirmDurableStop, idempotent) and/or thelifecycle-terminal-reaperCronJob. 200 = confirmed, 202 = stopping, 409 ONLY on a genuine non-request failure orcoordinator_owned. The UI shows "Stopping…" and polls to convergence. (This replaced the old one-shot model that hard-409'd and asked you to retry the same call.)The cross-app
durable/runStop WEDGE is solved BFF-side —call_child_workflowwas KEPT. Adurable/runstep dispatches its agent child viactx.call_child_workflow("session_workflow", app_id=<per-session agent app-id>)— a sub-orchestration on a SEPARATE per-session Dapr task hub, which Dapr's task-hub-bounded recursive terminate can't reach, so on Stop the cascade terminates the child agent fine but the PARENT hangsRUNNING(the "wedge"). Fix (#77, hardened #78/#79): the BFFconfirmDurableStopforce-finalizes the wedged parent — force-delete its durable state rows (theresetmechanism) + flip DB — treating it as DB-state cleanup since the agent is already stopped. It fires only on positive evidence: after a grace (LIFECYCLE_WEDGE_FINALIZE_GRACE_SECONDS, default 180s) the parent's livecurrentNodeIdis adurable/runnode whose child session is DB-terminated (not a booting-sandbox 404, not a later non-agent node). Rejected alternative: replacingcall_child_workflowwith fire-and-forget + status-poll dispatch was tried (#74/#75) and reverted (#76) — per-session Kueue sandboxes aren't Dapr-service-invokable (no<appid>-daprservice;call_child_workflowroutes via PLACEMENT not DNS), a start-ready cap broke SWE-bench, and the agent's first turn didn't fire underStartInstance→ "Inference stalled".call_child_workflowis the proven dispatch; don't re-attempt fire-and-poll.Single stop authority — a benchmark/eval INSTANCE is not stoppable on its own. A coordinator-driven benchmark/eval instance (a
workflow_executionsrow or its agent session) 409scoordinator_ownedon the generic per-execution/per-session Stop (both routes checkownsBenchmarkOrEvalRun); cancel the owning run instead (POST /api/benchmarks/runs/[id]/cancel/…/evaluations/runs/[id]/cancel, which cascade throughstopDurableRun(purge)). The UI hides the generic Stop and links to the run's Cancel.Delete/Archive is BLOCKED while a run is active — they 409 with "Stop the run first" (the controller's
inspectDurableRunreports the run still active). The sessions-list "Archive" row action was relabeled Delete (it always hard-DELETEd). Stop the run (terminate/purge) before deleting.Dapr workflow termination is still asynchronous under the hood —
terminatemeans "request shutdown", not proof of terminal. The controller handles the poll-to-terminal + per-session app-id fan-out for you (the native Dapr recursive cascade only reaches same-task-hub children; per-sessionsession_workflowchildren run under per-session sandbox app-ids, so the controller fans out terminate/purge explicitly per app-id). The orchestrator's oldterminate_durable_runs_by_parent_executionactivity was RETIRED (it only ever fanned out to the legacyclaude-code-agentapp-id). Don't add bespoke terminate-then-DB-flip code paths; call the controller.Both runtimes now stop mid-turn.
dapr-agent-py's cooperative cancel-key write/read AGREE fordurable/run(the read strips__turn__N/:turn-Nfrom candidate keys), so a mid-turnuser.interrupt/session.terminateactually halts (previously a silent no-op for workflow-driven sessions).claude-agent-pyreached management PARITY:POST /api/v2/agent-runs/{id}/{terminate,pause,resume}+ DELETE purge (viaDaprWorkflowClient), cancellation persistence, a between-turn cooperative-cancel check, andTERMINAL_CONTROL_EVENT_TYPES.agent.llm_usageinput_tokensis NET of cache reads — a SYSTEM INVARIANT. Every dapr-agent-py adapter emitsinput_tokensdisjoint fromcache_read_input_tokens(OpenAI + Alibaba report gross and are normalized withmax(0, gross - cache_read), wfb PR #90). Goal budgets (delta = input + output + cache_creation, cache READS excluded — codex semantics), Session Pulse cost, and the post-ingestcontext_*stamp all depend on it. A provider whose budgets/cost burn ~20× too fast on cached loops = a non-normalized adapter; check rawagent.llm_usagefor subset semantics. Seereferences/goal-loop.md.Goal continuations are exactly-once and driver-owned — don't hand-post them. The goal loop injects each continuation as a visible
user.messagewithorigin=goal-continuationand deterministicsourceEventId goal-continuation:<sid>:<iter>, gated by an atomic iteration claim onend_turnidles. A manual repost double-drives the turn. Interrupt-stop PAUSES the goal (resume via the Goal card/PATCH); a frozen loop after a BFF outage is recovered by thegoal-loop-tickCronJob's lost-idle probe (safe — Dapr buffers the raised event). Re-arm abudget_limitedgoal by setting a new one (goalId rotates, accounting resets). Seereferences/goal-loop.md.Session Pulse Context % trusts provider usage, not the local heuristic. The tile prefers the latest
context_*fields onagent.llm_usage(context_count_method=provider_usage) over the pre-calllocal_advisoryheuristic onagent.context_usage(which undercounts 20-25%), and INCLUDES cached tokens (window occupancy — matches Claude Code'scalculateContextPercentages). Budget accounting deliberately differs (work metric, net of cache). Don't "fix" one to match the other.isAgentTaskConfigis justcall === "durable/run". That's the entire check (seesrc/lib/types/agent-graph.ts,isAgentTaskConfigat ~L438). The canvas marks the nodetype: "agent"automatically. Don't worry about a strict TS body shape — both flat (with: {agentRef, prompt, ...}) and nested (with: {body: {agentRef, prompt, ...}, mode, sandboxName, ...}) are accepted at runtime.File operations are slug-as-action, not
workspace/filewith anoperationfield. Valid slugs:workspace/read_file,workspace/write_file,workspace/edit_file,workspace/list_files,workspace/delete_file,workspace/mkdir,workspace/file_stat. Callingworkspace/filewithoperation: "write"returnsworkspace-runtime HTTP 400: operation is required and must be one of read_file, write_file, edit_file, list_files, delete_file, mkdir, file_stat— that error message is the canonical list of valid slugs.agentRefplaceholders fail the resolver if they're a jq string. When you author${ .trigger.agentRef }in a workflow JSON'sdurable/run.with.body.agentRef, the BFF'sresolveSpecAgentRefsruns at workflow-LOAD time (before the orchestrator evaluates jq). It expectsagentRefto be an object literal withidorslug, sees the string, and throwsTask X (durable/run) is missing agentRef. All workflows must be backfilled to named agents before executing.For evals,service.tssolves this with a helperstampAgentRefIntoDurableRunSteps(spec, {id, version})that walks the spec and replaces the placeholder with the real ref before handing it to the resolver. If you build similar dispatch glue for non-eval flows, mirror the helper — don't try to make the resolver tolerate jq strings.with.keepAfterRun: trueis required to retain a workspace sandbox. The_should_cleanup_workspacesgate insw_workflow.py(def at ~L261) reads the spec directly (looking forworkspace/*steps withwith.keepAfterRun=trueORwith.body.input.keepAfterRun=true), not just task outputs — because openshell-agent-runtime doesn't echo the flag back. Without this flag, the live-preview proxy returns 404 "Retained sandbox not found" after the run.Removed slugs raise at dispatch. The orchestrator's
_REMOVED_AGENT_ACTION_TYPESset is exactly eight slugs:claude/run,openshell/run,openshell/session-start,openshell-langgraph/run,openshell-langgraph-observable/run,dapr-agent-py/run,dapr-swe/run,durable/plan. Any of these raisesRemoved SW 1.0 agent action. Note:mastra/*andagent/*are legacy/unsupported but are NOT in that set — they don't raise the "Removed" error; they fall through to the default route (function-router →function-registry_default{type: activepieces}, which computes a per-pieceap-<piece>-service) and fail there as an unknown piece/action. (The repo's own CLAUDE.md still listsmastra/*/agent/*as "rejected"; the code is authoritative — only the eight above hard-raise.)workflows.project_idis NOT NULL since migration 0040. Inserts must come through the BFF (which stampsprojectIdfromlocals.session.projectId) or stamp it manually via psql. Workflows without project_id can't appear in any workspace.POST
/api/workflowsdoes NOT write thespeccolumn. It writesname,nodes,edges,engineType,userId,projectIdonly (seesrc/routes/api/workflows/+server.ts:34-44). To setspec, follow up withPUT /api/workflows/[workflowId]withbody.spec. The bundledscripts/upsert-workflow.pydoes both calls.Dapr workflow state is shared through
workflowstatestore; agent app state is separate. Each workflow-enabled sidecar must see exactly oneactorStateStore=trueComponent. On current dev, that is namespace-wideworkflowstatestore(tablePrefix=wfstate_) for parent workflows, per-session agent workflows, timers, reminders, and activity bookkeeping.dapr-agent-py-statestoreis namespace-wide too, butactorStateStore=false; it is only the agent application state API store (tablePrefix=agent_py_). Do not create per-agent or per-session actor stores, and do not move durable history into pod-local state.Project MCP connections are resolved, not copied as secrets.
mcp_connectionrows point to server/catalog metadata and optionallyconnection_external_id;app_connectionstores encrypted OAuth credentials. AP credentials use reference-forwarding for BOTH MCP tools AND deterministic workflow activities: the caller forwards onlyX-Connection-External-Id(function-router writes thecredential_access_logsaudit, sourcereference_forwarded), and the piece-runtime self-resolves the plaintext by GETting the BFF/api/internal/connections/<id>/decrypt— the BFF is the SOLE decryptor. Do not put OAuth tokens or decrypted credential JSON into workflow specs, agent markdown, or KService env.AP action slugs run on the per-piece piece-runtime, not a monolith. An AP action slug (e.g.
github/create-issue) dispatches via function-router (function-registry_default{type: activepieces}) to the per-pieceap-<piece>-service— one convergedpiece-mcp-serverimage parameterized byPIECE_NAME— where it runs as a deterministic Dapr activity (POST /execute). The SAME service also serves/mcp(agent tools) +/options(canvas dropdowns).fn-activepieceswas deleted; the ~47ap-<piece>-serviceKnative services are reconciler-provisioned from enabledmcp_connectionrows + pinned pieces (all-catalog, so a new piece is automatic — no manual per-piece add).ActivePieces piece MCP URLs should not include
:3100through Knative. The container listens on 3100, but workflow/agent configs should target the cluster-local KService URL such ashttp://ap-microsoft-outlook-service.workflow-builder.svc.cluster.local/mcp. Stale:3100URLs bypass Knative and make agents look silent.MCP connection changes may require agent registry sync. Direct workflow runs resolve project MCP at execution, but published/direct agents also carry startup MCP config that the BFF stamps into the per-session Sandbox pod's env as
DAPR_AGENT_PY_BOOTSTRAP_MCP_SERVERS_JSON. After changing an agent's MCP settings, re-publish or call/api/agents/<id>/registry/sync, then verify the bootstrap env on the next launched Sandbox and the[mcp-bootstrap]logs.The include-all toggle attaches ALL project-level MCPs. In the Tools & Integrations surface, leaving an agent's own attach-list empty with the include-all toggle ON expands the project's
mcp_connectionrows into the agent's bootstrap list. Each piece MCP's KService (ap-<piece>-service) scales to zero by default, so the first launch serially cold-starts every one — a typical 5-piece Microsoft set hangs pod readiness for 2.5+ minutes. Fix: turn include-all OFF (attach only the specific servers the agent needs) for agents that don't actually use project MCPs (smoke-test agents, prompt-cache validation agents). The bootstrapmcpServersthen goes to[]and the Sandbox pod boots in <30s. Project-using agents are fine; the explicit attach-list just makes intent precise.Prompt-cache TTL + cache-key are per-provider. Anthropic uses
AgentConfig.cacheTtl: "5m" | "1h"(1h opt-in via theextended-cache-ttl-2025-04-11beta header — right call for Dapr durable agents pausing >5min between turns; 1h costs 2× the base for cache writes vs 1.25× at 5m). OpenAI ignorescacheTtl(no API surface) and instead usesprompt_cache_keyderived fromagent_id:versionto pin the cache shard. Both providers emit cross-provider telemetry onagent.llm_usage(prompt_cache_ttl,prompt_cache_eligible,prompt_prefix_chars, etc.) and a greppable[instruction-bundle] mode=... cache_ttl=... [provider=openai]log line. Threshold for cache eligibility is ≥4000 chars on the static prefix, shared between providers. Seereferences/prompt-caching.mdfor the protocol details.Direct DeepSeek is its own provider, not Together DeepSeek.
deepseek/deepseek-v4-proanddeepseek/deepseek-v4-flashmap tollm-deepseek-v4-pro/llm-deepseek-v4-flashand are handled byservices/dapr-agent-py/src/deepseek_adapter.pyagainst DeepSeek's OpenAI-compatible/chat/completionsendpoint usingDEEPSEEK_API_KEY. Normal chat enables thinking (DEEPSEEK_REASONING_EFFORT=maxby default); tool chat and structured summary calls disable thinking, and structured output usesresponse_format: {"type":"json_object"}rather thanjson_schema.Prompt Workbench is an authoring and preview surface, not a runtime templating engine. In V1, Mustache variables in agent persona fields, presets, or node prompt previews render only in the UI preview. Runtime still uses the canonical Dapr prompt path: one compiled system message from the instruction bundle,
chat_history, and the current user prompt appended by Dapr.Do not parameterize the stable system prompt unless the cache benefit is worth losing. The 9 LLM adapters monkeypatch
DaprChatClient.generateto call each provider's HTTP API directly — they deliberately bypass the still-ALPHA Dapr Conversation API, so prompt caching is the provider's own (Anthropiccache_control/cacheTtl, OpenAIprompt_cache_key), NOT a Conversation-component feature. Keep volatile values such as cwd, run id, session id, sandbox name, and workflow input out of the system/preset prefix when possible; put per-run data in the appended user prompt or workflow input so prompt-cacheable prefixes stay stable.Prompt presets are project-scoped, versioned, and latest-by-default.
resource_promptsis the parent row; template edits createresource_prompt_versionsrows withmessages,arguments,template_format, andtemplate_hash. Disabled or archived presets should not appear in normal picker results.Unresolved Mustache variables must stay visible in preview. A missing
{{runtime.cwd}}/{{args.foo}}value should produce a warning and preserve the placeholder, not silently blank the prompt.Ignoring unexpected taskCompleted eventis normal replay chatter, NOT stuck. durabletask-worker emits this during everycall_child_workflowreplay cycle. Real "stuck" signals: the target session Sandbox pod never got Kueue-admitted / never reached Running, OR the orchestrator emitting the sameOrchestrator yielded with N task(s) and 0 event(s) outstandingline for >5 min with placement flaps in target daprd logs. Check the Sandbox pod status +sessions.updated_atbefore assuming a hang.Workflow-builder dev runs via Skaffold file sync. Don't start
pnpm devor spin up local containers. The going-forward dev loop is Skaffold —pnpm dev:skaffoldfor inner-loop HMR (file sync into a Skaffold-owned dev pod),pnpm dev:skaffold:allfor the full 6-service set,pnpm deploy:skaffoldfor outer-loop GitOps commit-pin. See theskaffold-dev-loopskill. Forfn-system(Knative), treat the cluster's Argo-managed pod as a stable dependency — Skaffold sync into a transient Knative pod is impractical. The same caveat applies to both: pod env vars are baked at start time — when the BFF Deployment manifest changes (e.g.AGENT_RUNTIME_DEFAULT_IMAGE,SANDBOX_TEMPLATE_IMAGES_JSON), ArgoCD updates the standard ReplicaSet but a long-lived sync'd pod keeps the old env until Skaffold rebuilds.Sandbox templates are looked up by name in
SANDBOX_TEMPLATE_IMAGES_JSON. A workflow'sworkspace_profilestep takeswith.sandboxTemplate: <name>; the BFF resolves that name against theSANDBOX_TEMPLATE_IMAGES_JSONenv var on the workflow-builder Deployment to get the actual image. Currently registered:dapr-agent,default-sandbox,dapr-agent-xlsx,xlsx,code-eval. Adding a new template = add the env var entry in stacks, build the matchingservices/openshell-sandbox/environments/Dockerfile.<name>(commit subjectenvironment(<name>):triggers the env-image-build pipeline), and commit the resulting image ref. Hub Source Hydrator promotes through main → env/hub for hub and env/spokes-* for spokes. See the gitops skill for the GitOps cadence.Seed/run under the real user unless intentionally testing another account. Current dev/ryzen workflow and SWE-bench seed paths should use
vinod@pittampalli.com, notadmin@example.comordev-admin-user. Dev project isN1nbCo9zESa-S0UrzVrOw; ryzen project isAgbRSkJ_pVxerT_WOoZwF. UseSEED_SWEBENCH_FIXTURES_ROLLBACK=truefor a dry run, then seed withSEED_WORKFLOW_USER_EMAIL=vinod@pittampalli.comand the spoke project id.Eval-style workflows use
taskConfig.workflowIdto load specs from DB. When an evaluation hastaskConfig.workflowIdset,startEvaluationRunItemWorkflowloads that workflow row from theworkflowstable and stampsagentRefintotrigger.inputinstead of generating a spec in TypeScript. This is the path used by HumanEval+/MBPP+/BigCodeBench (thecode-eval-itemworkflow). The workflow'sdurable/runstep references${ .trigger.agentRef }so the BFF-supplied agent substitutes at dispatch time. Operators edit the JSON + re-runscripts/upsert-<workflow>-workflow.mjsto roll prompt/maxTurns changes without a BFF redeploy.Dapr-agent custom activities use scoped names only. With Dapr Agents 1.0.3, repo-owned
services/dapr-agent-pycustom workflow activities are registered and called throughself._activity_name(...). Do not restore the old bare-name fallback or register both names "for compatibility"; stale durable histories should be cancelled/cleaned/purged instead of keeping an ambiguous activity namespace. (This is a do-not-regress rule for this repo's runtime service. To author activities/agents in a fresh upstream Dapr app — generic@wfr.activity(name=...)/DurableAgent— use thedapr-agents-workflowskill.)Seed-script
${JSON.stringify(spec)}::jsonbcan double-encode. The standard upsert pattern inscripts/upsert-*.mjsuses postgres-js template literals with an explicit JSON.stringify + jsonb cast. Under some conditions (notably when running throughnode --input-type=module -e) the cast produces a JSONB string (a quote-wrapped JSON-text scalar) instead of a JSONB object. Symptom:jsonb_typeof(spec) = 'string'andspec->'do'returns null. Fix: re-upsert withsql.json(workflow.spec)— postgres-js handles serialization correctly without the explicit::jsonbcast.Orchestrator
wfstate_stateorphan reminders can block new StartInstance calls. TheworkflowstatestoreComponent isstate.postgresql v2withtablePrefix=wfstate_. When a workflow is purged but its actor reminder is still in dapr-scheduler-server's ETCD, daprd retries the reminder every ~10s and logsUnable to get data on the instance: <id>, no such instance exists. The retry loop can serialize behind the workflow runtime's worker queue and make newctx.call_child_workflow/StartInstancecalls hitDEADLINE_EXCEEDEDafter 60s. Confirm the daprd log pattern first. For terminal workflow cleanup, prefer the Lifecycle Controller (stopDurableRunwithmode:"purge"/"reset"— recursive purge forwardsforce, Dapr 1.17.9 cleans the associated reminders) or let the lifecycle-terminal-reaper CronJob reconcile it; use manualwfstate_statetruncation only as incident recovery after active runs and leases are zero.Dapr sidecar
1/2can be stale after control-plane churn. If the app container is ready butdaprdis not, probehttp://127.0.0.1:3501/v1.0/healthzfrom inside the pod and readkubectl logs <pod> -c daprd. A stale workflow-enabled sidecar can returnERR_HEALTH_NOT_READYforgrpc-api-server/grpc-internal-serverafter placement or scheduler restarts, while3500/v1.0/metadatastill responds. If logs showActor runtime shutting downorWorkflow engine stopped, recycle the affected Deployment after confirming the Dapr control plane is healthy. See gitopsrunbooks/debug-dapr-sidecar-stale-readiness.md.Don't roll workflow-orchestrator images while a workflow is mid-run. Dapr durable-task replay compares the in-process code's
call_activityordering to the persisted history. If a new image lands on the worker pod between yields, replay fails withSub-orchestration task #N failed: A previous execution called call_activity with ID=M, but the current execution doesn't have this action with this ID. The run is dead even if no orchestrator code actually changed (Python module import order, dep updates, or activity-registration reordering can shift IDs). Wait for active runs to finish before pushing image bumps. Hit twice on 2026-04-30 — both failed runs had this error and a fresh run on the stable image worked end-to-end.
Reference index
Load these on demand based on what you're doing.
| Task | File |
|---|---|
| Authoring a spec from scratch | references/sw-1.0-spec.md (12 task types, jq rules, validation checklist) + references/authoring-recipe.md (end-to-end) |
| Adding an agent step | references/agent-task.md (durable/run body) + references/cluster-topology.md (per-agent pods) |
| Persisting typed outputs (markdown / JSON / table / link / image) for run-detail rendering | references/workflow-artifacts.md (declarative artifacts: block + UI surfaces) |
| Editing agent prompts, presets, preview variables, or prompt-cache-sensitive content | references/prompt-workbench.md |
Reading per-provider prompt-cache telemetry, picking a cacheTtl, debugging cache hit rates |
references/prompt-caching.md (Anthropic 5m/1h, OpenAI auto + prompt_cache_key, agent.llm_usage field map) |
| Adding or debugging MCP tools/connections | references/mcp-connections.md (modes, project rows, ActivePieces auth, bootstrap checks) |
| Setting/managing session goals, debugging the goal loop or budgets, reading Session Pulse, or triaging per-provider usage accounting | references/goal-loop.md (driver mechanics, MCP completion contract, guardrails, tick reaper, net-of-cache invariant, eval scenarios) |
| Choosing the right action slug | references/action-catalog.md (routing table + catalog API) |
| Debugging pod placement, Dapr sidecars, or runtime topology | references/cluster-topology.md |
| Inspecting/editing the canvas JSON | references/canvas-shape.md (node + edge shapes) |
| Confirming a workflow renders + runs | references/verify-in-ui.md |
| Debugging a failed run | references/troubleshooting.md (symptom-keyed triage) |
| Stopping / terminating / purging a run, or recovering stuck durable/DB state | references/troubleshooting.md § Stopping a run (Lifecycle Controller) + § Stuck durable / DB state (now automated) + the SSOT docs/workflow-lifecycle-termination.md |
Each reference file is focused (60–250 lines) and starts with a short scope summary. Read only what's relevant.
Templates (assets/)
| File | Use when |
|---|---|
assets/minimal-http.workflow.json |
One system/http-request step. Trigger has one url property. Demonstrates jq full-string interpolation + output.as. |
assets/minimal-agent.workflow.json |
One durable/run step. Demonstrates agentRef, prompt with jq concat, mode, maxTurns, stopCondition. Includes the matching 3-node nodes/edges payload. |
assets/workspace-keepalive.workflow.json |
workspace/profile (keepAfterRun: true) → durable/run reading ${ .workspace_profile.sandboxName }. The sandbox-bridging pattern. |
assets/with-artifacts.workflow.json |
One durable/run step with an artifacts: block (kind: markdown, slot: primary). Renders the agent's response on the run-detail Overview tab front-and-centre. Demonstrates the post-task ${ .data.content } access pattern. |
assets/trigger-schema.snippet.json |
Drop-in input.schema.document block with form-friendly JSON Schema patterns (uri, enum, defaults, required). |
Open the file in assets/ first to see the exact shape before drafting your own. Edit a copy — don't modify the templates in-place.
Scripts (scripts/)
scripts/upsert-workflow.py <file.json>— POSTs{name, nodes, edges, engineType}to BFF/api/workflows, then PUTs the spec to/api/workflows/[id]. Resolvesproject_idautomatically from the user's session (orWORKFLOW_BUILDER_API_KEYenv). Falls back to apsqlupsert when the BFF is unreachable. Prints the workflowidand acanvasUrlHint(the canvas lives at/workspaces/<slug>/workflows/<id>— the POST response carries no workspace slug, so substitute the workspace you launched from). Use this instead of curl-ing the API by hand — every author needs the same boilerplate.
CLIs assumed available
| Tool | Typical use |
|---|---|
kubectl |
kubectl get sandbox -n workflow-builder -w to watch a per-session Sandbox pod get Kueue-admitted and start; kubectl logs deploy/workflow-orchestrator -n workflow-builder for parse errors |
psql |
Direct DB writes when the BFF isn't reachable; SELECT id, name, project_id FROM workflows ORDER BY updated_at DESC LIMIT 5; |
gh |
API spec diffs, GitHub Actions trigger context for webhook-triggered workflows |
dapr |
dapr workflow get -i <instance_id> --app-id workflow-orchestrator to inspect a stuck run |
scripts/upsert-workflow.py |
Insert/update a workflow by JSON file |
Safety guards before you act
- Don't direct-patch the
function-registryConfigMap on the cluster. Slug routing changes go through GitOps (PittampalliOrg/stacks). Read thegitopsskill for the promotion flow. - Don't
pnpm devin the workflow-builder repo. The canonical dev loop is Skaffold file sync into a Skaffold-owned dev pod (pnpm dev:skaffold— seeskaffold-dev-loop); fn-system uses the Argo-managed pod (Skaffold sync into Knative is impractical). - Don't run a workflow with
"prefix ${ .trigger.x }"style expressions — they'll silently pass through as literal text. Fix the jq to a single full-string expression first. - Don't insert workflows directly into psql without
project_id. Migration 0040 made the column NOT NULL; without it the workflow can't appear in any workspace. - Don't create per-agent Dapr state stores. Workflow runtimes use the centralized
workflowstatestore; agent application state uses centralizeddapr-agent-py-statestore. Per-agent/per-session stores make component visibility and durable replay harder to reason about. - Don't store MCP OAuth credentials in workflow JSON or agent markdown. Bind project MCP rows to
app_connection.external_idand let runtime requests carryX-Connection-External-Id.
Authoritative source files (in the repos, not in this skill)
When you need ground truth, read these:
- workflow-builder lifecycle (stop/terminate/purge):
docs/workflow-lifecycle-termination.md(SSOT),src/lib/server/lifecycle/{index,cascade,resolvers,reaper}.ts,src/routes/api/v1/sessions/[id]/stop/+server.ts,src/routes/api/workflows/executions/[executionId]/stop/+server.ts,src/routes/api/internal/lifecycle/reap-terminal/+server.ts - workflow-builder:
CLAUDE.md,docs/workflow-artifacts.md,services/workflow-orchestrator/core/sw_types.py,services/workflow-orchestrator/core/sw_expressions.py,services/workflow-orchestrator/workflows/sw_workflow.py,services/workflow-orchestrator/activities/resolve_mcp_config.py,services/workflow-orchestrator/activities/persist_artifact.py,src/lib/utils/spec-graph-adapter.ts,src/lib/types/agent-graph.ts,src/lib/types/agents.ts,src/lib/agents/model-options.ts,src/lib/agents/prompt-workbench-renderer.ts,src/lib/components/agents/prompt-workbench.svelte,src/lib/components/agents/prompt-preview.svelte,src/lib/components/workflow/execution/artifact-renderer.svelte,src/lib/components/workflow/execution/artifact-list.svelte,src/lib/server/agents/instruction-bundle.ts,src/lib/server/agents/mcp-resolution.ts,src/lib/server/mcp-connections.ts,src/lib/server/mcp-catalog.ts,src/lib/server/workflow-artifacts.ts,src/routes/api/workflows/+server.ts,src/routes/api/workflows/executions/[executionId]/artifacts/+server.ts,src/routes/api/internal/workflows/executions/[executionId]/artifacts/+server.ts,src/routes/api/prompt-presets/,src/routes/api/mcp-connections/,src/lib/server/action-catalog/index.ts,services/dapr-agent-py/src/deepseek_adapter.py,services/claude-agent-py/src/claude_sdk_runner.py,services/fn-system/src/steps/dapr-converse-structured-output.ts,services/piece-mcp-server/src/auth-resolver.ts,drizzle/0060_resource_prompt_versions.sql,drizzle/0067_workflow_artifacts.sql,atlas/migrations/20260501090000_add_resource_prompt_versions.sql,scripts/fixtures/sample-workflows.json - workflow-builder runtime SSOT:
services/shared/runtime-registry.json(canonical),services/workflow-orchestrator/core/runtime_registry.py,src/lib/server/agents/runtime-registry.ts,src/lib/server/agents/swap-safety.ts,src/lib/server/sessions/spawn.ts,scripts/sync-runtime-registry.mjs - workflow-builder goal loop + Pulse:
src/lib/server/goals/{goal-loop,repo,render}.ts+templates/{continuation,budget_limit}.md,src/routes/api/v1/sessions/[id]/goal/+server.ts,src/routes/api/internal/goal-loop/tick/+server.ts,services/workflow-mcp-server/src/{goal-tools,goal-db,goal-context}.ts,src/lib/components/sessions/{session-goal-badge,session-pulse}.svelte,src/lib/server/pricing/model-pricing.ts,services/dapr-agent-py/src/{openai_adapter,alibaba_adapter,event_publisher}.py,drizzle/0079_thread_goals.sql; stacks:workflow-builder/manifests/CronJob-goal-loop-tick.yaml,{Deployment,Service}-workflow-mcp-server.yaml - stacks:
packages/components/workloads/workflow-builder/manifests/Component-dapr-agent-py-statestore.yaml,packages/components/workloads/workflow-builder/manifests/Component-workflowstatestore.yaml,packages/components/workloads/activepieces-mcps/manifests/,packages/base/manifests/knative-serving/kustomization.yaml,packages/components/workloads/function-router/manifests/ConfigMap-function-registry.yaml, the upstreamkubernetes-sigs/agent-sandbox+ Kueue CRDs/manifests,packages/base/manifests/openshell/MutatingWebhookConfiguration-openshell-sandbox-dapr-webhook.yaml
The skill summarizes — these are authoritative if anything looks contradictory.