coordination-patterns - SKILL.md Agent Skill

name: Coordination Patterns description: This skill should be used when the user asks about "agent coordination", "MAS architecture", "blackboard pattern", "orchestrator pattern", "how agents communicate", "multi-agent workflow", "event-driven agents", "context engineering", "control flow", "stateless reducers", or needs to design how multiple agents work together. Covers patterns from Planner-Executor-Verifier to event-driven architectures, plus 12 Factor Agents principles. version: 1.1.0

Coordination Patterns

Pattern Selection Guide

Pattern	When to Use	Complexity
Planner→Executor→Verifier	Default starting point	Low
Blackboard	Multiple agents, shared state	Medium
Orchestrator-Worker	Dynamic task assignment	Medium
Hierarchical	Deep delegation chains	High
Event-Driven	High reliability needs	High
Market-Based	Dynamic load balancing	High

Pattern 1: Planner → Executor → Verifier (Minimum Viable MAS)

The baseline pattern that works most often.

User Request
    │
    ▼
┌─────────┐
│ Planner │ ──► Task Graph
└─────────┘
    │
    ▼
┌──────────┐
│ Executor │ ──► Results
└──────────┘
    │
    ▼
┌──────────┐
│ Verifier │ ──► PASS/FAIL
└──────────┘
    │
    ├─► PASS ──► Return Result
    │
    └─► FAIL ──► Re-plan with Feedback

Implementation

## Coordination Protocol

1. Planner receives requirements, outputs task graph
2. Executor processes tasks sequentially or parallel
3. Verifier checks all outputs against requirements
4. On FAIL: Planner receives feedback, re-plans
5. Max 3 iterations before escalation

When to Use

Starting a new MAS project
Tasks have clear decomposition
Verification is important but not adversarial

Limitations

Sequential bottleneck if tasks independent
Single verifier may miss issues
No internal checkpoints

Pattern 2: Blackboard Architecture

All agents read from and write to shared state.

┌─────────────────────────────────────┐
│           BLACKBOARD                │
│  ┌─────────┐ ┌─────────┐ ┌───────┐  │
│  │  Plan   │ │ Results │ │ State │  │
│  └─────────┘ └─────────┘ └───────┘  │
└─────────────────────────────────────┘
       ▲            ▲           ▲
       │            │           │
   ┌───┴───┐   ┌────┴────┐  ┌───┴────┐
   │Planner│   │Executors│  │Verifier│
   └───────┘   └─────────┘  └────────┘

Implementation

{
  "blackboard": {
    "sections": {
      "plan": {
        "owner": "Planner",
        "writers": ["Planner"],
        "readers": ["Executor", "Verifier"]
      },
      "results": {
        "owner": "Executor",
        "writers": ["Executor"],
        "readers": ["Verifier", "Orchestrator"]
      },
      "verdicts": {
        "owner": "Verifier",
        "writers": ["Verifier"],
        "readers": ["Planner", "Orchestrator"]
      }
    }
  }
}

Key Rules

No direct overwrites: Agents cannot modify others' sections
Versioned updates: Every write increments version
Read permissions: Explicit per section
Conflict-free: Writers have exclusive sections

When to Use

Multiple agents need shared context
Want to avoid direct agent-to-agent communication
Need audit trail of all state changes

Pattern 3: Orchestrator-Worker

Central orchestrator assigns tasks to worker agents.

          ┌──────────────┐
          │ Orchestrator │
          └──────────────┘
           /      |      \
          ▼       ▼       ▼
    ┌────────┐ ┌────────┐ ┌────────┐
    │Worker 1│ │Worker 2│ │Worker 3│
    └────────┘ └────────┘ └────────┘

Implementation

## Orchestrator Responsibilities
- Receive user request
- Decompose into tasks
- Assign to available workers
- Collect results
- Handle failures and retries
- Return final result

## Worker Protocol
- Poll for tasks or receive push
- Execute assigned task
- Report result to orchestrator
- No direct worker-to-worker communication

When to Use

Dynamic workload distribution
Workers are interchangeable
Need central control point

Pattern 4: Hierarchical Agent

Layered delegation with parent-child relationships.

              ┌───────────┐
              │  Manager  │
              └───────────┘
               /         \
              ▼           ▼
       ┌──────────┐ ┌──────────┐
       │ TeamLead │ │ TeamLead │
       └──────────┘ └──────────┘
        /       \       |
       ▼         ▼      ▼
   ┌──────┐ ┌──────┐ ┌──────┐
   │Worker│ │Worker│ │Worker│
   └──────┘ └──────┘ └──────┘

Implementation

## Delegation Rules
- Parents decompose tasks for children
- Children report completion to parent only
- No cross-branch communication
- Escalation goes up the hierarchy

## Span of Control
- Optimal: 3-5 direct reports per agent
- Max: 7 (coordination overhead scales)

When to Use

Complex domains with natural hierarchy
Different abstraction levels needed
Clear chains of responsibility

Pattern 5: Event-Driven Architecture

Agents react to events rather than direct calls.

┌─────────────────────────────────────┐
│           EVENT BUS                 │
└─────────────────────────────────────┘
     ▲         ▲         ▲         ▲
     │ publish │ publish │ publish │
     │         │         │         │
┌────┴──┐ ┌────┴──┐ ┌────┴──┐ ┌────┴──┐
│Agent A│ │Agent B│ │Agent C│ │Agent D│
└───────┘ └───────┘ └───────┘ └───────┘
     │         │         │         │
     ▼ sub     ▼ sub     ▼ sub     ▼ sub
┌─────────────────────────────────────┐
│           EVENT BUS                 │
└─────────────────────────────────────┘

Event Types

{
  "event_types": {
    "TaskCreated": { "triggers": ["Executor"] },
    "TaskCompleted": { "triggers": ["Verifier", "Logger"] },
    "VerificationFailed": { "triggers": ["Planner"] },
    "SystemError": { "triggers": ["AlertHandler"] }
  }
}

Benefits

Reliable coordination: Replayable events enable recovery
Loose coupling: Agents don't know about each other
Scalable: Easy to add new agents
Auditable: Complete event history

When to Use

High reliability requirements
Need fault tolerance
Complex event dependencies
Async processing beneficial

Pattern 6: Escalation Over Consensus

When agents disagree, escalate—don't vote.

┌─────────┐ ┌─────────┐
│ Agent A │ │ Agent B │
└─────────┘ └─────────┘
     │           │
     ▼           ▼
  Output A    Output B
     │           │
     └─────┬─────┘
           │ (conflict?)
           ▼
     ┌───────────┐
     │ Escalator │ ──► Final Decision
     └───────────┘

Why Not Vote?

Averaging dilutes correctness
Majority can be wrong
Loses minority insight

Escalation Protocol

1. Detect conflict (outputs differ significantly)
2. Preserve both outputs with reasoning
3. Escalate to higher-authority agent
4. Authority decides based on evidence, not popularity
5. Log decision rationale for learning

Anti-Patterns to Avoid

Anti-Pattern 1: Synchronous Blocking Chains

Bad: Agent A calls Agent B calls Agent C, each waiting.

Impact: Latency accumulates, one failure blocks all.

Fix: Use async message passing or events.

Anti-Pattern 2: Implicit State Sharing

Bad: Agents assume shared context without explicit state.

Impact: Race conditions, state corruption.

Fix: Use blackboard with explicit read/write permissions.

Anti-Pattern 3: Perfect Harmony

Bad: System designed for agents to always agree.

Impact: Groupthink, missed errors.

Fix: Add controlled friction (critics, independent verification).

12 Factor Agents: Control Flow and State

The 12 Factor Agents framework provides engineering principles for coordination.

Factor 3: Own Your Context Building

Principle: Everything that makes agents good is context engineering. Understand what happens at the token level.

Context building components:

System prompt - Agent identity and instructions
RAG results - Retrieved relevant information
Memory - Episodic and semantic recall
Agentic history - Previous steps in this workflow
Structured output instructions - Format requirements

Explicit Context Building Pattern:

def build_context(agent_id: str, task: dict) -> list:
    """Explicit context assembly - no magic."""
    context = []

    # 1. System prompt (Factor 2 - own your prompts)
    context.append({"role": "system", "content": AGENT_PROMPTS[agent_id]})

    # 2. RAG - retrieve relevant documents
    relevant_docs = retrieve(task["query"], top_k=3)
    context.append({"role": "system", "content": format_docs(relevant_docs)})

    # 3. Memory - recall from past
    memories = recall(agent_id, task["context"])
    context.append({"role": "system", "content": format_memories(memories)})

    # 4. Agentic history - what happened so far
    history = get_workflow_history(task["workflow_id"])
    context.append({"role": "system", "content": format_history(history)})

    # 5. Current task
    context.append({"role": "user", "content": task["input"]})

    return context

Context Budget:

Component	Token Budget	Purpose
System prompt	500-1000	Agent identity
RAG results	1000-2000	Relevant knowledge
Memory	500-1000	Past experiences
History	500-1000	Workflow context
Task input	Variable	Current request

Key insight: If you don't understand what happens at the token level, you miss optimization opportunities.

Factor 5/6: Unified Execution and Business State

Principle: Enable Launch/Pause/Resume with simple APIs. Unify what's happening (execution) with what's happened (business).

Unified State Schema:

{
  "workflow_id": "uuid",
  "status": "running|paused|completed|failed",

  "execution_state": {
    "current_step": "step_name",
    "next_step": "step_name|null",
    "waiting_for": "human_input|external_api|null",
    "retry_config": {
      "attempts": 2,
      "max_attempts": 3,
      "backoff_ms": 1000
    }
  },

  "business_state": {
    "messages": [],
    "tool_calls": [],
    "tool_results": [],
    "decisions_made": [],
    "artifacts_produced": []
  },

  "timestamps": {
    "created": "ISO8601",
    "last_updated": "ISO8601",
    "paused_at": "ISO8601|null",
    "resumed_at": "ISO8601|null"
  }
}

Launch/Pause/Resume API:

class WorkflowController:
    def launch(self, workflow_id: str, initial_input: dict) -> str:
        """Start workflow, return workflow_id."""
        state = create_initial_state(workflow_id, initial_input)
        self.state_store.save(workflow_id, state)
        self.execute_next_step(workflow_id)
        return workflow_id

    def pause(self, workflow_id: str, reason: str) -> bool:
        """Pause workflow, preserving all state."""
        state = self.state_store.load(workflow_id)
        state["status"] = "paused"
        state["timestamps"]["paused_at"] = now()
        state["execution_state"]["pause_reason"] = reason
        self.state_store.save(workflow_id, state)
        return True

    def resume(self, workflow_id: str) -> bool:
        """Resume from exactly where we left off."""
        state = self.state_store.load(workflow_id)
        state["status"] = "running"
        state["timestamps"]["resumed_at"] = now()
        self.state_store.save(workflow_id, state)
        self.execute_next_step(workflow_id)
        return True

See references/state-management.md for detailed implementation.

Factor 8: Own Your Control Flow

Principle: Don't let the LLM control the entire DAG. If you own control flow, you can Break, Switch, Summarize, Judge.

Control Flow Operations:

Operation	Purpose	When to Use
Break	Stop agent loop early	Error threshold, timeout, explicit stop signal
Switch	Route to different agent	Based on output classification
Summarize	Compress context	Approaching token limit
Judge	Evaluate quality	Before committing results

Anti-pattern: LLM-Controlled DAG

# BAD: LLM decides what to do next autonomously
response = llm.call("You have full control. What should we do next?")
next_action = parse_action(response)  # LLM controls flow

Pattern: Code-Controlled DAG

# GOOD: Code owns the control flow
def workflow_step(state: dict) -> dict:
    # 1. Execute current step
    result = execute_step(state["current_step"], state)

    # 2. Code decides next step (not LLM)
    if result["needs_human"]:
        return transition(state, "human_input")  # BREAK for human
    elif result["context_tokens"] > 6000:
        return transition(state, "summarize")    # SUMMARIZE
    elif result["quality_uncertain"]:
        return transition(state, "verify")       # JUDGE
    elif result["category"] == "complex":
        return transition(state, "specialist")   # SWITCH
    else:
        return transition(state, result["next"]) # CONTINUE

Smaller Focused Prompts Beat Long Autonomous Runs:

Instead of:
  "Do everything: plan, execute, verify, report"

Use:
  Step 1: "Create a plan" (focused prompt)
  [Code evaluates plan quality]
  Step 2: "Execute task X" (focused prompt)
  [Code checks result]
  Step 3: "Verify result" (focused prompt)
  [Code decides next step]

Factor 12: Stateless Reducers

Principle: Agent logic as pure functions that reduce (state, event) → new_state. Enables replay, debugging, and reasoning about behavior.

Reducer Pattern:

def agent_reducer(state: dict, event: dict) -> dict:
    """
    Pure function: no side effects, deterministic output.

    Args:
        state: Current workflow state
        event: What just happened (user input, tool result, etc.)

    Returns:
        New state (never mutates input)
    """
    new_state = deepcopy(state)

    match event["type"]:
        case "USER_INPUT":
            new_state["business_state"]["messages"].append(event["data"])
            new_state["execution_state"]["next_step"] = "plan"

        case "PLAN_CREATED":
            new_state["business_state"]["plan"] = event["data"]
            new_state["execution_state"]["next_step"] = "execute"

        case "TASK_COMPLETED":
            new_state["business_state"]["results"].append(event["data"])
            remaining = get_remaining_tasks(new_state)
            new_state["execution_state"]["next_step"] = "execute" if remaining else "verify"

        case "VERIFICATION_FAILED":
            new_state["execution_state"]["retry_config"]["attempts"] += 1
            new_state["execution_state"]["next_step"] = "replan"

    new_state["timestamps"]["last_updated"] = now()
    return new_state

Benefits of Reducer Pattern:

Replay: Feed same events, get same state
Debugging: Inspect state at any point
Testing: Pure functions are easy to test
Time travel: Rollback by replaying subset of events

Event Log for Replay:

{
  "workflow_id": "uuid",
  "events": [
    {"seq": 1, "type": "USER_INPUT", "data": {...}, "timestamp": "..."},
    {"seq": 2, "type": "PLAN_CREATED", "data": {...}, "timestamp": "..."},
    {"seq": 3, "type": "TASK_COMPLETED", "data": {...}, "timestamp": "..."}
  ]
}

To replay: final_state = reduce(agent_reducer, events, initial_state)

Additional Resources

Reference Files

For detailed implementation patterns:

references/event-driven-details.md - Complete event-driven implementation
references/state-management.md - State synchronization strategies (includes unified state, reducers)
../agent-specification/references/twelve-factor-agents.md - Quick reference for all 12 factors

Related Skills

agent-specification - Define each agent properly (Factors 1, 2, 4, 7)
production-readiness - Add observability and error handling (Factors 9, 11)
mas-decision-gate - Decide if multi-agent is needed (Factor 10)