name: agent-workflow-graphs description: > Guide the design of graph-based workflows for AI agents -- branching, chaining, merging, conditions, suspend/resume for human-in-the-loop, streaming updates to users, and observability with tracing. Use when the user needs to build structured multi-step agent processes, add deterministic control flow to LLM-powered systems, implement durable workflows that survive crashes, or add tracing and observability. Also use when user mentions "workflow", "graph", "branching", "chaining", "suspend", "resume", "tracing", "observability", "OpenTelemetry", "durable execution", or asks how to make their agent follow a specific step sequence.
Agent Workflow Graphs
When agents have too much freedom, they produce unpredictable results. Graph-based workflows constrain the agent to a structured process while still leveraging LLM intelligence at each step.
Core Concept
A workflow graph breaks a complex task into discrete steps connected by edges. Each step can:
- Call an LLM for a focused decision
- Execute deterministic code
- Call external APIs
- Wait for human input
The key insight: the LLM makes a few binary decisions instead of one big decision.
Workflow Primitives
Branching (Fan-Out)
Trigger multiple LLM calls on the same input in parallel:
Input --> Step 1 (check symptom A)
--> Step 2 (check symptom B)
--> Step 3 (check symptom C)
Use when: You need to check multiple independent things. Better to have 12 parallel calls each checking one symptom than 1 call checking all 12.
Chaining (Sequential)
Feed the output of one step into the next:
Step 1 (fetch data) --> Step 2 (analyze) --> Step 3 (summarize)
Each step waits for the previous step and has access to prior results via a shared context object.
Use when: Steps have dependencies -- each needs the previous step's output.
Merging (Fan-In)
After branching paths diverge, converge their results:
Step 1 --\
--> Merge step (combine results) --> Output
Step 2 --/
Use when: You branched earlier and need to combine independent results into a single output.
Conditions
Execute steps conditionally based on intermediate results:
Step 1 (fetch data)
|
v
[condition: fetchData.status === "success"]
|
v
Step 2 (process data)
Use when: Workflow paths depend on runtime results (success/failure, data type, user choice).
Best Practices for Workflow Steps
- Meaningful I/O at each step: Design inputs and outputs so they make sense in your tracing UI
- One LLM call per step maximum: Each step should do one focused thing
- Combine primitives: Loops, retries, and complex patterns are all compositions of these four primitives
Suspend and Resume
Problem
Workflows sometimes need to pause for external input (human approval, webhook callback, long-running external process).
Solution
Persist the workflow state, then resume from exactly where it left off.
Step 1 --> Step 2 --> [SUSPEND] --> waiting for human approval
|
human approves |
v
[RESUME] --> Step 3 --> Step 4
Implementation Pattern
- Define
suspendSchemaon the step that needs to pause - Call
suspend()with a payload (what you're waiting for) - The workflow persists its state to a database
- When the external event arrives, call
resume()with the response data - The workflow continues from the suspended step
Key Insight
This is the workflow equivalent of HITL (human-in-the-loop). The workflow doesn't keep a running process alive -- it serializes state and picks back up later.
Streaming Updates
Why Streaming Matters
A 10-second blank screen feels broken. The same 10 seconds with live progress updates feels fast and responsive.
What to Stream
- LLM tokens: Show text as it's generated
- Workflow step updates: "Searching... Analyzing... Writing..."
- Partial results: Push intermediate outputs before the workflow completes
How to Build Streaming
- Stream as much as you can: Tokens, workflow steps, custom data
- Use reactive tools: ElectricSQL, Turbo Streams, SSE for real-time updates
- Escape hatches: If a function is stuck waiting, push partial results to the frontend
Pattern: Streaming from Workflow Steps
Each step can emit progress updates to the client while executing. The client renders updates as they arrive, creating a responsive experience even for multi-minute agent runs.
Observability and Tracing
Why Observability is Critical
LLMs are non-deterministic. The question isn't whether your agent will go off the rails. It's when and how much.
Tracing
A trace is a tree of spans showing the input/output of every function called during an agent run. Think of it like a flame chart or nested HTML document.
Standard format: OpenTelemetry (OTel) -- use it for portability across vendors.
What a Tracing UI Shows
- Trace view: How long each step took (parse_input, process_request, api_call, etc.)
- Input/output inspection: Exact JSON data flowing in and out of each LLM call
- Call metadata: Status, start/end times, latency, operation type
Eval Integration
Tracing UIs also show eval results:
- Side-by-side comparison of agent response vs. expected
- Overall score per PR (to catch regressions)
- Score over time, filterable by tags and run date
Best Practices
- Emit traces in OpenTelemetry format for vendor portability
- Use a cloud tool for production; local tracing tools (like Mastra's dev UI) for development
- Look at production traces regularly -- they reveal failure patterns that tests miss
Decision Framework
| Situation | Pattern |
|---|---|
| Multiple independent checks on same input | Branching (fan-out) |
| Sequential dependent steps | Chaining |
| Combining parallel results | Merging (fan-in) |
| Runtime-dependent paths | Conditions |
| Need human approval mid-workflow | Suspend/resume |
| Users waiting for multi-step results | Streaming updates |
| Debugging production agent failures | Tracing + observability |
Gotchas
- Workflows add complexity. Only use them when agents are too unpredictable without structure.
- Design each step's I/O to be meaningful -- you'll be reading it in traces.
- One LLM call per step. Multi-call steps are harder to debug and trace.
- Suspend/resume requires persistent state storage (database, not memory).
- Streaming isn't optional for production agent UX. Users need to see progress.
- Use OpenTelemetry for tracing. Proprietary formats lock you into a vendor.
For implementation examples, see references/workflow-examples.md.