choosing-swarm-patterns - SKILL.md Agent Skill

name: choosing-swarm-patterns description: Use when coordinating multiple AI agents with Agent Relay's workflow engine and need to pick the right orchestration pattern - covers the 10 core patterns (fan-out, pipeline, hub-spoke, consensus, mesh, handoff, cascade, dag, debate, hierarchical) plus 14 specialized ones, with decision framework and accurate workflow/YAML examples.

Choosing Swarm Patterns

Overview

The Agent Relay workflow engine (@relayflows/core) supports 24 swarm patterns via a single swarm.pattern field. Patterns are configured declaratively in YAML or programmatically via the workflow() fluent builder — there are no standalone fanOut(...) / hubAndSpoke(...) helpers. Pick the simplest pattern that solves the problem; add complexity only when the system proves it's insufficient.

Two ways to run a pattern

1. YAML (portable):

import { runWorkflow } from '@relayflows/core';

const run = await runWorkflow('workflows/feature-dev.yaml', {
  vars: { task: 'Add OAuth login' },
});

2. Fluent builder (programmatic):

import { workflow } from '@relayflows/core';

const run = await workflow('feature-dev')
  .pattern('hub-spoke')
  .channel('swarm-feature-dev')
  .agent('lead', { cli: 'claude', role: 'lead' })
  .agent('developer', { cli: 'codex', role: 'worker', interactive: false })
  .step('plan', { agent: 'lead', task: 'Plan {{task}}' })
  .step('implement', { agent: 'developer', task: 'Implement: {{steps.plan.output}}', dependsOn: ['plan'] })
  .run();

Both paths hit the same WorkflowRunner.

Quick Decision Framework

Is the task independent per agent?
  YES → fan-out (parallel workers, hub collects)

Does each step need the previous step's output?
  YES → Is it strictly linear?
    YES → pipeline
    NO  → dag (parallel where possible, `dependsOn` edges)

Does a coordinator need to stay alive and adapt?
  YES → hub-spoke (single-level hub + workers)
        hierarchical (structurally identical in current impl; use for naming/intent)

Is the task about making a decision?
  YES → Do agents need to argue opposing sides?
    YES → debate (adversarial, full mesh)
    NO  → consensus (cooperative, full mesh + coordination.consensusStrategy)

Does the right specialist emerge during processing?
  YES → handoff (sequential chain, one active at a time)

Do all agents need to freely collaborate?
  YES → mesh (full peer-to-peer edges)

Is cost the primary concern?
  YES → cascade (chain of increasingly capable agents; each step's prompt
        decides whether to pass through or redo the prior output)

Pattern Reference (Core 10)

#	Pattern	Topology (actual edges)	Best For
1	fan-out	Hub broadcasts to N workers; workers reply to hub only	Independent subtasks (reviews, research, tests)
2	pipeline	Linear chain (agenti → agent{i+1})	Ordered stages (design → implement → test)
3	hub-spoke	Hub ↔ spokes (bidirectional); no spoke-to-spoke	Dynamic coordination, lead reviews/adjusts
4	consensus	Full mesh; decision via `coordination.consensusStrategy`	Architecture decisions, approval gates
5	mesh	Full mesh (every agent ↔ every other)	Brainstorming, collaborative debugging
6	handoff	Chain; passes control forward	Triage, specialist routing
7	cascade	Chain of `dependsOn` steps; all run on success, downstream skipped on upstream failure (no built-in "fall through")	Cost optimization: cheap first, each step's prompt passes through or redoes
8	dag	Edges from step `dependsOn`	Mixed dependencies, parallel where possible
9	debate	Full mesh (same topology as mesh; roles drive behavior)	Rigorous adversarial examination
10	hierarchical	Hub + subordinates (single-level in current impl)	Large teams; semantic distinction from hub-spoke

Heads up: hierarchical resolves to the same edge structure as hub-spoke in coordinator.ts:313-319. Multi-level tree topology is not currently implemented — use pattern name for intent, but expect the same runtime graph.

Additional Patterns (role-driven)

These 14 additional patterns exist in SwarmPattern (types.ts:114-139). The coordinator has role-based auto-selection heuristics (coordinator.ts:51-165), but they only fire when swarm.pattern is omitted — YAML validation requires it (runner.ts:2105-2117), so auto-selection is effectively a programmatic-API feature. In YAML, set swarm.pattern explicitly.

Topology is still resolved per-pattern once selected; the "Triggering roles" column reflects what the coordinator looks for to shape edges (per coordinator.ts:250-450):

Pattern	Roles the topology keys off	Topology
`map-reduce`	`mapper` + `reducer`	coordinator → mappers → reducers → coordinator
`scatter-gather`	—	hub → workers → hub
`supervisor`	`supervisor`	supervisor ↔ workers
`reflection`	`critic` or `reviewer` (auto-select uses `critic` only)	producers → critic → producers (loop)
`red-team`	`attacker`/`red-team` + `defender`/`blue-team`	adversarial mesh with optional judges
`verifier`	`verifier`	producers → verifiers → back to producers
`auction`	`auctioneer`	auctioneer → bidders → auctioneer
`escalation`	`tier-*`	tiered chain, escalate up / report down
`saga`	`saga-orchestrator`, `compensate-handler`	orchestrator ↔ participants
`circuit-breaker`	`primary` + `fallback`/`backup`	try primary, fallback on failure
`blackboard`	`blackboard` / `shared-workspace`	shared state hub
`swarm`	`hive-mind` / `swarm-agent`	stigmergy-style
`competitive`	— (declared explicitly)	independent parallel implementations + judge
`review-loop`	`implement` + 2+ `reviewer`	implementer ↔ reviewers

Structured Squad Review Loop

For serious implementation work, especially workflow generation or product-contract changes, prefer a composite squad-review-loop recipe over a plain single implementer plus final reviewer. This is a workflow authoring recipe built from existing patterns, not a separate SDK enum unless the local runner has added one.

Use this when the fastest reliable path is small teams of 2-3 agents working in parallel with live feedback:

Split the work into bounded implementation squads. Each squad owns a non-overlapping file or subsystem scope.
Give each squad an implementer plus a shadow/review partner. The shadow follows the implementer in real time, checks alignment with the spec, and posts concise feedback before the work drifts.
Require the implementer to self-reflect before external review: compare the final diff against the spec, AGENTS.md / CLAUDE.md, recent local conventions, tests, and declared non-goals.
Run an independent self-review/fresh-eyes agent that reads the actual files and recent repo context, not just the chat transcript.
Send that review back to the implementer for one repair round.
After squads converge, run a final two-agent review team, usually one Claude reviewer and one Codex reviewer, independently. They compare notes, merge findings, and produce one final verdict.
Spawn fresh fix agents for final-review findings. Those fix agents self-reflect, then the final reviewers re-check the post-fix state until the spec is fully satisfied or a blocker is documented.

Pattern selection for this recipe:

Use supervisor or hub-spoke when a lead needs to coordinate live squads.
Use review-loop when the main risk is code quality and feedback iteration.
Use reflection when critic feedback should loop directly back to producers.
Use verifier when completion evidence matters more than design debate.
Use competitive only when independent alternative implementations are useful; otherwise split by ownership scope.

Keep squads small. Two or three agents per squad is usually the useful limit: implementer, shadow/reviewer, and optionally test/validation owner. More agents belong in separate squads or in the final review team.

Pattern Details

All examples below use real API shapes (WorkflowBuilder / YAML), verified against @relayflows/core's builder.d.ts and schema.d.ts.

YAML fragments vs complete configs: The per-pattern YAML snippets below are fragments that show only the pattern-relevant shape. A runnable YAML file also requires version: "1.0" and name: <id> at the top (runner.ts:2105-2117). See the Complete YAML Example for the full structure.

Topology edges exclude interactive: false agents. resolveTopology (coordinator.ts:218-237) drops non-interactive agents from the message graph — they run as one-shot subprocesses with no relay connection. Topology claims like "hub ↔ spokes" describe the interactive-agent edges; workers marked interactive: false are spawned and collected via stdout, not via relay messages.

1. fan-out — Parallel Workers

await workflow('review')
  .pattern('fan-out')
  .agent('lead', { cli: 'claude', role: 'lead' })
  .agent('auth-rev', { cli: 'claude', role: 'worker', interactive: false })
  .agent('db-rev', { cli: 'claude', role: 'worker', interactive: false })
  .step('review-auth', { agent: 'auth-rev', task: 'Review auth.ts' })
  .step('review-db', { agent: 'db-rev', task: 'Review db.ts' })
  .run();

Workers run independently; hub aggregates. No inter-worker edges.

2. pipeline — Sequential Stages

swarm: { pattern: pipeline }
agents:
  - { name: designer, cli: claude }
  - { name: implementer, cli: codex, interactive: false }
  - { name: tester, cli: codex, interactive: false }
workflows:
  - name: build
    steps:
      - {
          name: design,
          agent: designer,
          task: 'Design the API schema',
          verification: { type: output_contains, value: DONE },
        }
      - {
          name: implement,
          agent: implementer,
          dependsOn: [design],
          task: 'Implement: {{steps.design.output}}',
        }
      - { name: test, agent: tester, dependsOn: [implement], task: 'Write integration tests' }

Each stage receives the previous stage's output via {{steps.<name>.output}}. Halts on step failure unless onError: retry / continue.

3. hub-spoke — Persistent Coordinator

await workflow('api-build')
  .pattern('hub-spoke')
  .channel('swarm-api')
  .agent('lead', { cli: 'claude', role: 'lead' })
  .agent('db-worker', { cli: 'claude', role: 'worker' }) // interactive by default — hub DMs it
  .agent('api-worker', { cli: 'claude', role: 'worker' }) // interactive by default — hub DMs it
  .step('models', { agent: 'db-worker', task: 'Build database models' })
  .step('routes', { agent: 'api-worker', task: 'Build route handlers', dependsOn: ['models'] })
  .step('review', { agent: 'lead', task: 'Review everything', dependsOn: ['routes'] })
  .run();

Hub (picked via role: lead or first agent) stays on the channel and direct-messages interactive workers via the flat send_dm MCP tool, often exposed by workflow prompts as mcp__relaycast__send_dm.

Don't set interactive: false on a hub-spoke worker if you want it to receive coordination DMs — resolveTopology strips non-interactive agents from the message graph (coordinator.ts:218-237). Use interactive: false only when the worker is a one-shot subprocess whose stdout you collect via {{steps.X.output}} without any mid-run coordination.

4. consensus — Cooperative Voting

swarm: { pattern: consensus }
agents:
  - { name: perf, cli: claude, role: reviewer }
  - { name: dx, cli: claude, role: reviewer }
  - { name: sec, cli: claude, role: reviewer }
coordination:
  consensusStrategy: majority # declarative marker: majority | unanimous | quorum
  votingThreshold: 0.66
workflows:
  - name: decide
    steps:
      - { name: evaluate-perf, agent: perf, task: 'Evaluate perf of Fastify migration' }
      - { name: evaluate-dx, agent: dx, task: 'Evaluate DX of Fastify migration' }
      - { name: evaluate-sec, agent: sec, task: 'Evaluate security of Fastify migration' }

Full-mesh topology. Caveat: coordination.consensusStrategy and votingThreshold are declared in CoordinationConfig (types.ts:768-772) but the runner has no built-in vote-tallying logic — the fields only influence coordinator auto-selection (coordinator.ts:63-64). To implement voting, aggregate the step outputs in a downstream lead/judge step that reads {{steps.evaluate-*.output}}.

5. mesh — Peer Collaboration

await workflow('debug-auth')
  .pattern('mesh')
  .channel('swarm-debug')
  .agent('logs', { cli: 'claude' })
  .agent('code', { cli: 'claude' })
  .agent('repro', { cli: 'claude' })
  .step('logs', { agent: 'logs', task: 'Check server logs' })
  .step('code', { agent: 'code', task: 'Review auth code' })
  .step('repro', { agent: 'repro', task: 'Write repro test' })
  .run();

Every agent ↔ every other agent. Use for collaborative exploration without hierarchy.

6. handoff — Dynamic Routing

swarm: { pattern: handoff }
agents:
  - { name: triage, cli: claude }
  - { name: billing, cli: claude }
  - { name: tech, cli: claude }
workflows:
  - name: support
    steps:
      - { name: triage, agent: triage, task: 'Triage: {{request}}' }
      - { name: billing, agent: billing, dependsOn: [triage], task: 'Handle billing' }
      - { name: tech, agent: tech, dependsOn: [triage], task: 'Handle tech issues' }

Chain passes control forward. Note: The runner doesn't support "route to one branch and skip the others" declaratively — dependsOn steps all run when their dependencies complete, and skipping is only triggered by upstream failure (runner.ts:7057-7088). For true pick-one routing, have the triage step emit a routing token in its output and let each downstream step's prompt check {{steps.triage.output}} and no-op if it doesn't match.

7. cascade — Cost-Aware Fallthrough

await workflow('answer')
  .pattern('cascade')
  .agent('haiku', { cli: 'claude', model: 'claude-haiku-4-5-20251001' })
  .agent('sonnet', { cli: 'claude', model: 'claude-sonnet-4-6' })
  .agent('opus', { cli: 'claude', model: 'claude-opus-4-7' })
  .step('try-haiku', { agent: 'haiku', task: '{{question}}' })
  .step('try-sonnet', {
    agent: 'sonnet',
    task: 'If this is a complete answer, echo it verbatim. Otherwise answer anew:\n{{steps.try-haiku.output}}',
    dependsOn: ['try-haiku'],
  })
  .step('try-opus', {
    agent: 'opus',
    task: 'Final-tier answer, using prior attempts for context:\n{{steps.try-sonnet.output}}',
    dependsOn: ['try-sonnet'],
  })
  .run();

Important: cascade only sets edge topology. The runner has no skip-on-success logic for the cascade pattern — a chain of dependsOn steps all execute in order on success, and failed upstream steps mark their dependents as skipped (step-executor.ts:329-334, runner.ts:7057-7088). So a verification-gated first step won't "fall through" to later steps on failure, and won't skip them on success either. The idiom above delegates the escalation decision to the prompt of each downstream step (read the upstream answer and pass-through or redo). No confidence-score parsing exists in-engine.

8. dag — Directed Acyclic Graph

await workflow('fullstack')
  .pattern('dag')
  .maxConcurrency(3)
  .agent('dev', { cli: 'codex', role: 'worker' })
  .step('scaffold', { agent: 'dev', task: 'Create project scaffold' })
  .step('frontend', { agent: 'dev', task: 'Build React UI', dependsOn: ['scaffold'] })
  .step('backend', { agent: 'dev', task: 'Build API', dependsOn: ['scaffold'] })
  .step('integrate', { agent: 'dev', task: 'Wire together', dependsOn: ['frontend', 'backend'] })
  .run();

Runner derives execution waves from dependsOn; independent nodes run in parallel up to swarm.maxConcurrency. The dag pattern is auto-selected when any step has dependsOn.

9. debate — Adversarial Refinement

Debate currently shares the full-mesh topology with mesh and consensus. Differentiate via roles + task prompts:

swarm: { pattern: debate }
agents:
  - { name: pro, cli: claude, role: debater, task: 'Argue FOR monorepo' }
  - { name: con, cli: claude, role: debater, task: 'Argue FOR polyrepo' }
  - { name: judge, cli: claude, role: judge, task: 'Decide after 3 rounds' }
coordination:
  barriers:
    - { name: debate-done, waitFor: [pro-round-3, con-round-3] }

Drive rounds and verdicts through the agent's system prompt/task, not a dedicated maxRounds knob — there isn't one at the pattern level.

10. hierarchical — Multi-Level (structurally hub-spoke today)

await workflow('large-team')
  .pattern('hierarchical')
  .agent('lead', { cli: 'claude', role: 'lead' })
  .agent('fe-coord', { cli: 'claude', role: 'coordinator' })
  .agent('be-coord', { cli: 'claude', role: 'coordinator' })
  .agent('fe-dev', { cli: 'codex', role: 'worker', interactive: false })
  .agent('be-dev', { cli: 'codex', role: 'worker', interactive: false })
  .step('plan', { agent: 'lead', task: 'Coordinate full-stack app' })
  .step('fe-plan', { agent: 'fe-coord', task: 'Manage frontend', dependsOn: ['plan'] })
  .step('be-plan', { agent: 'be-coord', task: 'Manage backend', dependsOn: ['plan'] })
  .step('fe-impl', { agent: 'fe-dev', task: 'Build components', dependsOn: ['fe-plan'] })
  .step('be-impl', { agent: 'be-dev', task: 'Build API', dependsOn: ['be-plan'] })
  .run();

Coordinator/worker distinction is expressed in step dependsOn graph, not topology. Agent edges collapse to single-level hub-spoke.

Verification & Completion Signals

An agent step can complete in several ways (runner.ts:5353-5395, runner.ts:4527-4538):

Verification pass — when the step declares a verification block and the output satisfies it.
Clean process exit — agent exits 0 with no verification configured.
Evidence-based — channel posts, file changes, or coordination signals trigger completion.
Owner decision — a lead-role agent posts COMPLETE / INCOMPLETE_RETRY / INCOMPLETE_FAIL for the step.

Verification block shape:

verification:
  type: output_contains # or: exit_code | file_exists | custom
  value: DONE # or: PLAN_COMPLETE, IMPLEMENTATION_COMPLETE, REVIEW_COMPLETE

Conventional signals expected by the @relayflows/core runner:

ACK: ... — received a task
DONE: ... — task complete

The runner captures PTY chunks as step output and also records channel posts + file changes as StepCompletionEvidence. Legacy fallback: a file at .relay/summaries/{stepName}.md is read if PTY output is empty.

Agent Relay MCP - Correct Tool Names

The old category-expanded names are wrong. Current Agent Relay MCP tools are flat names. In a client that decorates MCP tools, the prefix comes from the configured server key; workflow prompts commonly show mcp__relaycast__send_dm, while an agent-relay server key may expose mcp__agent_relay__send_dm.

Purpose	Canonical tool	Common workflow-prefixed form
Send DM to another agent	`send_dm`	`mcp__relaycast__send_dm`
Check inbox	`check_inbox`	`mcp__relaycast__check_inbox`
List agents	`list_agents`	`mcp__relaycast__list_agents`
Post to a channel	`post_message`	`mcp__relaycast__post_message`
Reply in a thread	`reply_to_thread`	`mcp__relaycast__reply_to_thread`
Spawn sub-agent	`add_agent`	`mcp__relaycast__add_agent`
Remove sub-agent	`remove_agent`	`mcp__relaycast__remove_agent`

interactive: false agents run as non-interactive subprocesses with no relay connection. They must not call Relay MCP tools.

Reflection (Trajectories)

Reflection is not a reflectionThreshold callback. It's configured via the trajectories: block:

trajectories:
  enabled: true
  reflectOnBarriers: true # config flag exists but runner does NOT currently invoke this path
  reflectOnConverge: true # fires at parallel convergence points (runner.ts:2762-2779)
  autoDecisions: true # record retry/skip/fail decisions

What actually runs today: only reflectOnConverge is wired into the runner (runner.ts:2762-2779). shouldReflectOnBarriers is defined in trajectory.ts:486-487 but not called — set the flag if you want forward compatibility, but don't depend on it.

Programmatic equivalent:

workflow('x').trajectories({ enabled: true, reflectOnConverge: true });

For a first-class critic loop, use the reflection pattern (agents with role: critic get wired as reviewers in coordinator.ts:363-378).

Common Mistakes

Mistake	Why It Fails	Fix
Using mesh/debate for everything	Full-mesh blows up message volume past ~5 agents	Use hub-spoke or dag for most tasks
Pipeline for independent work	Sequential bottleneck	Use fan-out or dag
Hub-spoke for 2 agents	Hub is unnecessary overhead	Use pipeline or fan-out
Expecting `consensusStrategy` to tally votes	Runner has no vote-tally logic; field only affects coordinator auto-selection	Aggregate votes in a judge/lead step that reads `{{steps.*.output}}`
Handoff with "routing = skip other branches"	Skipping only fires on upstream failure, not routing decisions	Emit a routing token in triage output; downstream prompts self-no-op if token doesn't match
Cascade expecting skip-on-success	Runner has no cascade skip logic; failed upstream skips downstream	Chain downstream prompts to pass-through or redo based on `{{steps.previous.output}}`
Relying on `reflectOnBarriers`	Config flag exists but runner never calls it	Use `reflectOnConverge` for convergence reflection; use `reflection` pattern for critic loops
`interactive: false` agent calling MCP	Non-interactive subprocess has no relay	Use `interactive: true` (default) or emit output on stdout
Relying on multi-level `hierarchical`	Topology is single-level hub in current impl	Use pattern for naming; model levels via `dependsOn` graph
Writing `mcp__relaycast__send(...)`	Wrong tool name	Use `post_message` / `mcp__relaycast__post_message` or `send_dm` / `mcp__relaycast__send_dm`

Resume & Re-run

// Resume a failed run:
await runWorkflow('feature-dev.yaml', { resume: '<runId>' });

// Skip ahead, re-using cached outputs from an earlier run:
await runWorkflow('feature-dev.yaml', {
  startFrom: 'review',
  previousRunId: '<runId>',
});

Cached outputs live in .agent-relay/step-outputs/; runs in .agent-relay/workflow-runs.jsonl. Env vars RESUME_RUN_ID, START_FROM, PREVIOUS_RUN_ID are auto-detected.

Complete YAML Example

version: '1.0'
name: feature-dev
description: 'Blueprint-style feature development with quality gates.'
swarm:
  pattern: hub-spoke
  maxConcurrency: 2
  timeoutMs: 3600000
  channel: swarm-feature-dev
  idleNudge: { nudgeAfterMs: 120000, escalateAfterMs: 120000, maxNudges: 1 }
agents:
  - { name: lead, cli: claude, role: lead, permissions: { access: full } }
  - { name: planner, cli: codex, role: planner, interactive: false, permissions: { access: readonly } }
  - { name: developer, cli: codex, role: worker, interactive: false, permissions: { access: readwrite } }
  - { name: reviewer, cli: claude, role: reviewer, permissions: { access: readonly } }
workflows:
  - name: feature-delivery
    onError: retry
    preflight:
      - { command: 'git status --porcelain', failIf: non-empty, description: 'Clean worktree' }
    steps:
      - name: plan
        agent: planner
        task: 'Plan: {{task}}'
        verification: { type: output_contains, value: PLAN_COMPLETE }
      - name: implement
        agent: developer
        dependsOn: [plan]
        task: 'Implement: {{steps.plan.output}}'
        verification: { type: output_contains, value: IMPLEMENTATION_COMPLETE }
      - name: test
        type: deterministic
        dependsOn: [implement]
        command: npm test
      - name: review
        agent: reviewer
        dependsOn: [test]
        task: 'Review implementation'
        verification: { type: output_contains, value: REVIEW_COMPLETE }
coordination:
  barriers:
    - { name: delivery-ready, waitFor: [plan, implement, review], timeoutMs: 900000 }
trajectories:
  enabled: true
  reflectOnBarriers: true
  reflectOnConverge: true
errorHandling:
  strategy: retry
  maxRetries: 2
  retryDelayMs: 5000

Built-in templates live in @relayflows/core/dist/builtin-templates/ (feature-dev, bug-fix, code-review, competitive, documentation, refactor, review-loop, security-audit).

Source of Truth

Claim	File
Pattern enum (24 patterns)	`@relayflows/core/dist/schema.d.ts` (`SwarmPattern`)
Topology resolution per pattern	`@relayflows/core/dist/coordinator.js`
Interactive-only topology edges	`@relayflows/core/dist/coordinator.js` filters `interactive: false` agents
Pattern auto-selection heuristics	`@relayflows/core/dist/coordinator.js`
`WorkflowBuilder` fluent API	`@relayflows/core/dist/builder.d.ts`
`runWorkflow(yamlPath, options)`	`@relayflows/core/dist/run.d.ts`
YAML validation requires `version` + `name` + `swarm.pattern`	`@relayflows/core/dist/runner.js`
MCP tool names	`packages/cli/src/cli/agent-relay-mcp.ts`, `@relayflows/core/dist/channel-messenger.js`
Completion modes (verification / evidence / owner / process-exit)	`@relayflows/core/dist/runner.js`, `@relayflows/core/dist/step-executor.js`
Trajectory reflection	`@relayflows/core/dist/trajectory.js`, `@relayflows/core/dist/runner.js`