swarm

name: swarm description: Dispatch parallel agents. practices:

microservices
team-topologies
mythical-man-month hexagonal_role: supporting consumes:
implement
validate produces:
.agents/swarm/results/*.json context_rel:
kind: customer-of with: crank skill_api_version: 1 context: window: fork intent: mode: task sections: exclude:
- HISTORY intel_scope: full metadata: tier: orchestration dependencies:
- implement
- validate output_contract: .agents/swarm/results/*.json

Swarm Skill

Spawn isolated agents to execute tasks in parallel. Fresh context per agent (Ralph Wiggum pattern).

Before you swarm — pick the lightest path that fits

Confirm a swarm is even warranted. A swarm pays off only with ≥2 genuinely independent units of working-tree work that benefit from isolation. If it's one deliverable, pure analysis/investigation, or no working-tree edits, do NOT swarm — use the lighter path below. Reaching for the machinery on a small task costs more than the task (real failure 2026-06-15: an ATM-codex swarm pointed at a ~9-idea content task wedged on codex boot; in-session Agent fan-out did it in one pass).

Three paths, lightest first — reach for the lightest that fits:

Path	What it is	Use when
In-session Agent/Task fan-out (lightest)	Spawn 2–3 `Agent` subagents in this session. No persistence, no worktrees, no attach, dies with the session. Read-only-friendly.	One-shot parallel work: independent drafts, fan-out analysis, fresh-eyes review. Default for anything small. See `automation-shape-routing` shape 0.
`/swarm` (middle)	Wave-gated working-tree execution with disjoint file ownership + conflict checks (this skill).	≥2 independent units that edit the working tree and need isolation + wave-validity gating.
ATM (`/using-atm`, heaviest)	Persistent tmux panes + human attach/steer + multi-vendor, running whole `/rpi`/`/evolve` loops.	Long-lived epics needing persistence and live steering — not one-shot tasks. Boot cost (esp. codex) alone can exceed doing it inline.

Loop position

Move 5 (wave execution) of the operating loop, specifically the parallel-fork primitive /crank invokes. Refuses to spawn parallel agents on a wave that has not cleared the wave-validity check in the slice validation plan: write scopes must be disjoint, no shared migration/contract/CLI surface, integration order declared when it matters, one owner per slice, discard path per slice. Parallelism is explicit ownership, not swarm chaos. Default to sequential when the wave-validity rows are not all green.

Coupled-chain rule (DERIVED-surface collision). Two slices that both regenerate a shared derived surface — cli-command-surface, registry.json, context-map, or the codex manifest — collide even if their source files are disjoint. Such a coupled chain (schema→write-model→export→gate→tests) MUST run sequential: each link branches off the freshly-MERGED prior link's SHA, never a pre-fan snapshot. Reserve parallel waves for provably-disjoint surfaces, cap 4–6.

Integration modes:

Direct - Create TaskList tasks, invoke /swarm
Via Crank - /crank creates tasks from beads, invokes /swarm for each wave

Requires multi-agent runtime. Swarm needs a runtime that can spawn parallel subagents. If unavailable, work must be done sequentially in the current session.

Architecture (Mayor-First)

Mayor (this session)
    |
    +-> Plan: TaskCreate with dependencies
    |
    +-> Identify wave: tasks with no blockers
    |
    +-> Select spawn backend (gc if available; runtime-native: Claude teams in Claude runtime, Codex sub-agents in Codex runtime; fallback tasks if unavailable)
    |
    +-> Assign: TaskUpdate(taskId, owner="worker-<id>", status="in_progress")
    |
    +-> Spawn workers via selected backend
    |       Workers receive pre-assigned task, execute atomically
    |
    +-> Wait for completion (wait() | SendMessage | TaskOutput)
    |
    +-> Validate: Review changes when complete
    |
    +-> Cleanup backend resources (close_agent | TeamDelete | none)
    |
    +-> Repeat: New team + new plan if more work needed

Execution

Read references/execution-steps.md when you need the full procedural detail (Steps 0–6): backend detection, gc dispatch, task typing + file manifests, context briefing, manifest auto-population, advisory bead clustering, wave identification, pre-spawn conflict check, test-file naming validation, multi-wave base-SHA refresh, and worker dispatch.

Every TaskCreate must include metadata.issue_type plus a metadata.files array. Do not spawn workers with overlapping file manifests into the same shared-worktree wave.

Example Flow

Mayor: "Let's build a user auth system"

1. /plan -> Creates tasks:
   #1 [pending] Create User model
   #2 [pending] Add password hashing (blockedBy: #1)
   #3 [pending] Create login endpoint (blockedBy: #1)
   #4 [pending] Add JWT tokens (blockedBy: #3)
   #5 [pending] Write tests (blockedBy: #2, #3, #4)

2. /swarm -> Spawns agent for #1 (only unblocked task)

3. Agent #1 completes -> #1 now completed
   -> #2 and #3 become unblocked

4. /swarm -> Spawns agents for #2 and #3 in parallel

5. Continue until #5 completes

6. /validate -> Validate everything

Scope-Escape Protocol

When a worker discovers work outside their assigned scope, they MUST NOT modify files outside their file manifest. Instead, append to .agents/swarm/scope-escapes.jsonl:

{"worker": "<worker-id>", "finding": "<description>", "suggested_files": ["path/to/file"], "timestamp": "<ISO8601>"}

For richer scope-escape narration (status classification, concrete next step, evidence), see references/scope-escape-template.md. Use the template when a single-line JSONL entry is insufficient for the operator to act on.

The lead reviews scope escapes after each wave and creates follow-up tasks as needed.

Key Points

Runtime-native local mode - Auto-selects the native backend for the current runtime (gc pool, Claude teams, or Codex sub-agents)
Universal orchestration contract - Same swarm behavior across Claude and Codex sessions
Pre-assigned tasks - Mayor assigns tasks before spawning; workers never race-claim
Fresh worker contexts - New sub-agents/teammates per wave preserve Ralph isolation
Wave execution - Only unblocked tasks spawn
Mayor orchestrates - You control the flow, workers write results to disk
Thin results - Workers write .agents/swarm/results/<id>.json, orchestrator reads files (NOT Task returns or SendMessage content)
Retry via message/input - Use send_input (Codex) or SendMessage (Claude) for coordination only
Atomic execution - Each worker works until task done
Graceful degradation - If multi-agent unavailable, work executes sequentially in current session

Worker report contract + lane authority (cp-hhd7, cards 4 + 16)

Worker FINAL REPORT — what the orchestrator requires

Workers write raw evidence; the orchestrator does not trust prose summaries. Each .agents/swarm/results/<id>.json or the worker's final message SHOULD include:

files_changed: [list of exact paths]
commit_sha: <git rev-parse HEAD>   # or "no commit" + reason
test_tail: <verbatim last N lines of test output>
conflicts_surfaced: [list, or "none"]

A result missing commit_sha or test_tail is treated as unverified until the orchestrator independently confirms persistence and test passage. The audit cost of trusting summaries exceeds the cost of requiring the fields.

Lane authority (POLICY, card 4 — applies when running a two-lane swarm)

In a multi-lane deployment (this is a contextual policy, not a universal swarm rule), a decision inside one lane's scope is decided by that lane with the other lane's view as input — not a vote, not an escalation. Escalate to the human operator only for:

genuine out-of-both-lanes decision forks
gate violations
a loop that cannot self-heal

Escalating an in-lane sequencing call makes the operator a bottleneck. The lane authority rule is enforced by convention in the control-plane context; the mechanism (swarm) is general.

Workflow Integration

This ties into the full workflow:

/research -> Understand the problem
/plan -> Decompose into beads issues
/crank -> Autonomous epic loop
    +-- /swarm -> Execute each wave in parallel
/validate -> Validate results
/post-mortem -> Extract learnings

Direct use (no beads):

TaskCreate -> Define tasks
/swarm -> Execute in parallel

The knowledge flywheel captures learnings from each agent.

Task Management Commands

# List all tasks
TaskList()

# Mark task complete after notification
TaskUpdate(taskId="1", status="completed")

# Add dependency between tasks
TaskUpdate(taskId="2", addBlockedBy=["1"])

Parameters

Parameter	Description	Default
`--max-workers=N`	Max concurrent workers	5
`--from-wave <json-file>`	Load wave from OL hero hunt output (see OL Wave Integration)	-
`--per-task-commits`	Commit per task instead of per wave (for attribution/audit)	Off (per-wave)

When to Use Swarm

Scenario	Use
Single one-shot deliverable / no working-tree edits / read-only investigation	do NOT swarm → do it inline, or fan out 2–3 in-session Agent subagents (see "Before you swarm" above + `automation-shape-routing` shape 0)
Multiple independent tasks	`/swarm` (parallel)
Sequential dependencies	`/swarm` with blockedBy
Mix of both	`/swarm` spawns waves, each wave parallel

Why This Works: Ralph Wiggum Pattern

Follows the Ralph Wiggum Pattern: fresh context per execution unit.

Wave-scoped worker set = spawn workers -> execute -> cleanup -> repeat (fresh context each wave)
Mayor IS the loop - Orchestration layer, manages state across waves
Workers are atomic - One task, one spawn, one result
TaskList as memory - State persists in task status, not agent context
Filesystem for EVERYTHING - Code artifacts AND result status written to disk, not passed through context
Backend messaging for signals only - Short coordination signals (under 100 tokens), never work details

Ralph alignment source: ../shared/references/ralph-loop-contract.md.

Integration with Crank

When /crank invokes /swarm: Crank bridges beads to TaskList, swarm executes with fresh-context agents, crank syncs results back.

You Want	Use	Why
Fresh-context parallel execution	`/swarm`	Each spawned agent is a clean slate
Autonomous epic loop	`/crank`	Loops waves via swarm until epic closes
Just swarm, no beads	`/swarm` directly	TaskList only, skip beads
RPI progress gates	`/flywheel`	Tracks progress; does not execute work

Codex-Specific Coordination (folded from codex-team)

When the selected backend is Codex (Codex CLI on PATH or Codex sub-agents), swarm follows the same wave model, but Codex agents are headless — they cannot negotiate locks or wait turns. The orchestrator IS the lock manager: all conflict prevention happens before spawning, via file-target analysis.

Backend selection (Codex path), in order:

spawn_agent available → Codex experimental sub-agents (preferred)
Codex CLI available → Codex CLI via Bash (codex exec ...)
skill tool read-only (OpenCode) → OpenCode subagents (task(subagent_type="general", ...))
None → fall back to the runtime-native swarm backend / sequential

Decision-time warning (codex boot-wedge). A Codex-backed swarm adds a boot-failure surface — codex can fail to boot at all (the actual 2026-06-15 failure: a codex swarm wedged before doing any work). For a small task the boot cost alone can exceed doing it inline, so first confirm the swarm is warranted (see "Before you swarm" at the top), then run the pre-flight below before committing to a Codex swarm.

Pre-flight (CLI backend only): verify which codex, then test the configured default model with codex exec --full-auto -C "$(pwd)" "echo ok". If either fails, fall back to another backend.

Canonical command + flag order:

codex exec --full-auto -C "$(pwd)" -o <output-file> "<prompt>"

Flag order: --full-auto → -C → -o → prompt (insert -m "<model>" before -C only when intentionally pinning a model; otherwise the user's default is used). Valid flags: --full-auto, -m, -C, -o, --json, --output-schema, --add-dir, -s. Do NOT use -q / --quiet (don't exist).

Cross-project tasks: --add-dir /path/to/other/repo (repeatable) grants access beyond -C.
Progress monitoring: add --json to stream JSONL events (turn.started / turn.completed with token usage). No events for 60s → agent likely stuck.
Sandbox levels (-s): read-only for judges/reviewers, workspace-write (default with --full-auto), danger-full-access only in externally-sandboxed environments. Prefer -s read-only for review/analysis tasks.

File-target strategy (apply before spawning Codex agents):

File Overlap	Strategy	Action
All tasks touch same file	Merge	Combine into 1 agent with all fixes
Some tasks share files	Multi-wave	Shared-file tasks go sequential across waves
No overlap	Parallel	Spawn all agents at once

For multi-wave Codex runs, the lead reads prior-wave results, then injects a summary of what changed (not raw diffs) into the next wave's prompts. Limits: ≤6 agents/wave, 120s default timeout, ≤3 waves (reconsider decomposition beyond that). Output dir: .agents/codex-team/ (or the standard .agents/swarm/results/). Concrete tool calls: references/backend-codex-subagents.md.

Codex troubleshooting: codex not found → npm i -g @openai/codex or fall back; default model unavailable → verify the echo ok pre-flight or pin with -m; empty/missing output → ensure the -o directory exists and is writable.

OL Wave Integration

Read references/ol-wave-integration.md when you invoke /swarm --from-wave <json-file> — covers pre-flight ol CLI check, input JSON format, task creation from wave entries, completion backflow via ol hero ratchet, and example flow.

References

Executable acceptance: references/swarm.feature — wave-validity gate, fresh-context workers, conflict-free ownership, results+cleanup (soc-qk4b)
Local Mode Details: skills/swarm/references/local-mode.md
Validation Contract: skills/swarm/references/validation-contract.md

Examples

Building a User Auth System

User says: /swarm

What happens:

Agent identifies unblocked tasks from TaskList (e.g., "Create User model")
Agent selects spawn backend using runtime-native priority (Claude session -> Claude teams; Codex session -> Codex sub-agents)
Agent spawns worker for task #1, assigns ownership via TaskUpdate
Worker completes, team lead validates changes
Agent identifies next wave (tasks #2 and #3 now unblocked)
Agent spawns two workers in parallel for Wave 2

Result: Multi-wave execution with fresh-context workers per wave, zero race conditions.

Direct Swarm Without Beads

User says: Create three tasks for API refactor, then /swarm

What happens:

User creates TaskList tasks with TaskCreate
Agent calls /swarm without beads integration
Agent identifies parallel tasks (no dependencies)
Agent spawns all three workers simultaneously
Workers execute atomically, report to team lead via SendMessage or task completion
Team lead validates all changes, commits once per wave

Result: Parallel execution of independent tasks using TaskList only.

Worktree Isolation (Multi-Epic Dispatch)

Read references/shared-checkout-discipline.md first when the target checkout (~/dev/<repo>) is shared with peer agents — it documents when worktrees are mandatory (vs. optional) and the three failure modes (branch-deletion data loss, swarm attribution confounded, destructive-recovery temptation) that motivate the discipline.

Read references/worktree-isolation.md when you need to dispatch workers across multiple epics or run waves with overlapping files — covers isolation semantics per backend, effort levels, post-spawn verification, manual worktree creation/routing/merge-back, the Merge Arbiter Protocol, cleanup, and the --worktrees / --no-worktrees parameters.

Worktree reaping (teardown). After a worker's PR is confirmed MERGED (gh pr view --json state = MERGED), reap its tree: git worktree remove <path> --force then git worktree prune. Leave unmerged-PR worktrees intact. Target zero orphaned worktrees — bound the live count to in-flight PRs. (The committed disposition gate scans the tracked file set, not live on-disk worktrees, so this teardown is the operational backstop.)

Troubleshooting

Read references/troubleshooting.md for full diagnostics.

Problem	See
Worktree isolation did not engage	references/troubleshooting.md
Workers produce file conflicts	references/troubleshooting.md
Team creation fails	references/troubleshooting.md
Codex agents unavailable	references/troubleshooting.md
Workers timeout or hang	references/troubleshooting.md
gc backend detected but workers unresponsive	references/troubleshooting.md
Tasks assigned but workers never spawn	references/troubleshooting.md

Related skills

/using-atm — out-of-session ATM substrate when a swarm needs persistent panes and human attach/steer.

Swarm Skill

Before you swarm — pick the lightest path that fits

Loop position

Architecture (Mayor-First)

Execution

Example Flow

Scope-Escape Protocol

Key Points

Worker report contract + lane authority (cp-hhd7, cards 4 + 16)

Worker FINAL REPORT — what the orchestrator requires

Lane authority (POLICY, card 4 — applies when running a two-lane swarm)

Workflow Integration

Task Management Commands

Parameters

When to Use Swarm

Why This Works: Ralph Wiggum Pattern

Integration with Crank

Codex-Specific Coordination (folded from codex-team)

OL Wave Integration

References

Examples

Building a User Auth System

Direct Swarm Without Beads

Worktree Isolation (Multi-Epic Dispatch)

Troubleshooting

Related skills

Reference Documents