name: autonomous-loop-patterns description: "Use when designing, reviewing, or debugging an autonomous AI agent loop: repeated agent execution, completion signals, checkpoints, supervisor respawn, stall detection, safety caps, and human handoff rules. Covers the core loop patterns from simple bounded runs through sentinel-based continuation, checkpoint-resume, and external supervisor loops. Do NOT use for choosing a specific agent product command (use agent-engineering or the product's docs), writing ordinary task instructions (use prompt-craft), or optimizing individual tool calls (use tool-call-strategy)." metadata: relations: "{"related":["prompt-craft","tool-call-strategy","context-management","agent-engineering","observability-modeling"],"verify_with":["observability-modeling","agent-engineering"]}" subject: agent-ops public: "true" scope: "Use when designing, reviewing, or debugging an autonomous AI agent loop: repeated agent execution, completion signals, checkpoints, supervisor respawn, stall detection, safety caps, and human handoff rules. Covers the core loop patterns from simple bounded runs through sentinel-based continuation, checkpoint-resume, and external supervisor loops. Do NOT use for choosing a specific agent product command (use agent-engineering or the product's docs), writing ordinary task instructions (use prompt-craft), or optimizing individual tool calls (use tool-call-strategy)." taxonomy_domain: agent/loop-design triggers: "["autonomous-loop-skill","loop-patterns-skill","agent-loop-design"]" keywords: "["autonomous agent loop","agent loop pattern","completion signal","checkpoint resume loop","supervisor respawn","stall detection","safety cap","agent watchdog","human handoff"]" stability: experimental mental_model: "An autonomous loop has six primitives: a trigger, a worker agent, a progress signal, a stop condition, durable state, and a safety cap. Different loop patterns place those primitives in different owners: a bounded run keeps them in the prompt and runtime limit, a sentinel loop keeps the stop condition in a completion marker, a checkpoint loop persists state between sessions, and a supervisor loop keeps restart and timeout policy outside the worker." purpose: "Autonomous loop patterns replace improvised keep-going instructions with explicit control design. They solve the failure mode where an agent keeps retrying without a stop rule, loses progress after a restart, or appears active while making no useful progress." concept_boundary: "This skill owns loop control shape, not the work performed inside each iteration. Use prompt-craft for the wording of a single worker prompt, tool-call-strategy for per-tool efficiency, agent-engineering for broader multi-agent system architecture, context-management for what context to load, and observability-modeling for telemetry schema design." analogy: "An autonomous loop is an autopilot mode: it can keep flying, but only because it has instruments, altitude limits, a route, and a clear handoff back to a pilot." misconception: "The common mistake is treating autonomy as permission to run forever. A safe loop is defined by when it stops, what state it writes, what evidence proves progress, and what cap forces human review." skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph" skill_graph_project: Skill Graph skill_graph_canonical_skill: skills/agent-ops/autonomous-loop-patterns/SKILL.md
Autonomous Loop Patterns
Concept of the skill
An autonomous loop has six primitives: a trigger, a worker agent, a progress signal, a stop condition, durable state, and a safety cap.
Coverage
- Core primitives of an autonomous agent loop: trigger, worker agent, progress signal, stop condition, durable state, and safety cap.
- Pattern selection across bounded single-run loops, sentinel continuation loops, checkpoint-resume loops, and external supervisor loops.
- Completion signal design: explicit done markers, tracker state, exit status, persisted status files, and observable progress evidence.
- Safety design: iteration limits, consecutive-error limits, elapsed-time limits, budget limits, context-health exits, and human handoff thresholds.
- Stall detection and recovery: heartbeat age, unchanged work state, repeated failures, repeated plan churn, and supervisor escalation.
- Checkpoint and handoff contracts: what state must persist between runs and what state must never live only in agent memory.
- Anti-patterns that make autonomous loops unsafe: unbounded retry, prompt-only reliability, hidden mutable state, and silent respawn storms.
Philosophy of the skill
An autonomous agent loop is not just an agent being told to continue. It is a control system. The agent is one component; the loop decides when to run it again, what evidence proves progress, what state survives a crash, and when a human must take over.
The smallest safe loop is usually better than the most powerful loop. A one-off task with a clear finish condition does not need a queue supervisor. A multi-session backlog should not rely on a single completion word. A long-running unattended process must not depend on the worker agent remembering its own state.
The quality bar is explicit termination plus recoverable state. If a loop cannot answer "why did this run again?", "what changed since the last iteration?", and "what stops it from running forever?", it is not an autonomous loop. It is an uncapped retry.
The Six Primitives
| Primitive | Question it answers | Examples |
|---|---|---|
| Trigger | What starts the next iteration? | User request, queue item, scheduler tick, failed verification |
| Worker agent | Who performs one unit of work? | A coding agent, reviewer agent, data extractor, browser runner |
| Progress signal | How do we know anything changed? | Commit, test result, status update, artifact write, metric delta |
| Stop condition | What means the loop is done? | Empty queue, completion marker, passing gate, explicit human stop |
| Durable state | What survives crash or context reset? | Checkpoint file, issue comment, job record, append-only log |
| Safety cap | What forces review when progress fails? | Max iterations, max consecutive errors, elapsed-time cap, budget cap |
Design the primitives first. Tooling choices come second.
Pattern Catalog
| Pattern | Best for | Stop owner | State owner | Main risk |
|---|---|---|---|---|
| Bounded single-run loop | One clear task that should finish in one session | Runtime limit or final verification gate | The current run plus final artifact | Agent tries to continue after the task is already done |
| Sentinel continuation loop | Small repeated task with a precise done marker | Completion marker checked by a wrapper or hook | Transcript plus optional counter | Completion marker appears accidentally or never appears |
| Checkpoint-resume loop | Multi-session work where context may reset | Checkpoint state and remaining-work count | Durable checkpoint | Stale checkpoint causes repeated or skipped work |
| Supervisor respawn loop | Long-running unattended throughput | External supervisor | Status files, queue, and logs | Respawn storm after repeated failure |
| Human-gated loop | Risky work with side effects or unclear requirements | Human approval gate | Review record and approved next action | Loop waits without making the escalation visible |
Pattern 1: Bounded Single-Run Loop
Use this for a single task with a concrete finish condition. The prompt names the deliverable, the runtime enforces a hard limit, and verification decides whether the run is done.
Use when:
- The work has one primary deliverable.
- The done condition can be verified inside one run.
- Failure is reviewable from the final artifact and logs.
- Restarting from scratch would not lose meaningful progress.
Required safeguards:
- A hard iteration or elapsed-time limit.
- A final verification command or acceptance gate.
- A clear final status: done, blocked, or failed.
Do not use this for long backlogs, stateful migrations, or tasks where partial progress must be resumed.
Pattern 2: Sentinel Continuation Loop
A sentinel loop repeats until the worker emits a precise completion marker, or until a wrapper decides the marker is absent and starts another turn. Some teams call this the Ralph Wiggum pattern: the runtime keeps going until the agent says the exact stop phrase.
Prompt contract:
Do the task described below.
Completion condition: all requested changes are implemented and verification passes.
When and only when the completion condition is true, output this exact marker:
TASK_COMPLETE_9F3A
If the task is blocked, output BLOCKED with the reason instead.
Do not output the completion marker in code, examples, logs, or explanations.
Use when:
- The task is small enough that repeated turns stay understandable.
- The completion condition is easy to state as a marker contract.
- A wrapper can count iterations and stop after a cap.
Required safeguards:
- Use an uncommon marker, not a word like "done".
- Count iterations outside the model.
- Stop on a blocked marker instead of continuing forever.
- Keep the marker out of code snippets and examples.
Do not use this when the task spans many sessions, requires durable queue state, or has high-risk side effects.
Pattern 3: Checkpoint-Resume Loop
A checkpoint loop persists the state needed to resume later. The worker writes a checkpoint at the end of each run. The next run reads it, verifies it against current reality, and continues.
Minimum checkpoint contract:
{
"objective": "short stable goal",
"iteration": 3,
"max_iterations": 10,
"remaining_work": ["item-a", "item-b"],
"completed_work": ["item-0"],
"last_verified_evidence": "test name or artifact reference",
"context_health": "ok | degraded | exhausted",
"next_action": "the first action for the next run",
"stop_reason": null
}
Use when:
- The work cannot safely fit in one context window.
- Partial progress must survive a restart.
- The loop must decide whether work remains before starting another run.
- A fresh run may need a compact handoff instead of full history.
Required safeguards:
- Write checkpoints atomically where the platform supports it.
- Treat checkpoint state as a cache; verify current reality before acting.
- Stop when context health is exhausted, even if work remains.
- Include the next action so the next run does not rediscover the plan.
Do not store the only copy of progress in model memory or chat history.
Pattern 4: Supervisor Respawn Loop
A supervisor loop runs outside the worker agent. It starts a worker, watches status and timeout signals, records the result, and decides whether to spawn another worker.
Use when:
- Many independent work items need unattended throughput.
- Each worker should start with fresh context.
- The supervisor can own queue selection, timeout, and retry policy.
- Workers may fail independently without ending the whole process.
Required safeguards:
- Per-worker timeout.
- Consecutive-error cap.
- Queue item lock or claim before work starts.
- Status write on success, failure, blocked, and timeout.
- Supervisor log that explains every respawn decision.
Do not let a supervisor respawn a worker after repeated identical failures without changing state, backoff, or escalation.
Completion Signals Ranked
| Signal | Reliability | Use it for | Failure mode |
|---|---|---|---|
| Authoritative tracker state | High | Queue and backlog loops | Tracker update omitted or duplicated |
| Passing verification gate | High | Coding, data, or document loops | Gate is too shallow or not rerun |
| Explicit sentinel marker | Medium-high | Small repeated tasks | Marker appears accidentally or never appears |
| Durable checkpoint says no work remains | Medium | Checkpoint loops | Checkpoint is stale |
| Process exit code | Medium | Supervisor loops | Exit code lacks semantic detail |
| Absence of new output | Low | Last-resort stall hint only | Quiet work and stalled work look the same |
Prefer authoritative tracker state and verification gates when available. Use sentinel markers for small loops. Use absence of output only as a stall warning, never as proof of completion.
Safety Caps
Every autonomous loop needs at least one cap. Unattended loops usually need several.
| Cap | Prevents | Typical default |
|---|---|---|
| Max iterations | Endless continue loops | 5-15 iterations, lower for risky work |
| Consecutive errors | Respawn storms | Stop after 3 repeated failures |
| Elapsed time | Long silent runs | Based on expected phase duration |
| Work item lock age | Zombie ownership | Expire only after evidence of worker death |
| Context health | Low-quality late-session changes | Stop and hand off at exhausted context |
| Budget | Runaway cost | Small initial budget, staged increase |
The cap must be enforced outside the worker when possible. A model instruction that says "do not loop forever" is not a cap.
Stall Detection
A loop is stalled when it keeps consuming iterations without improving the durable state.
Common stall signals:
- Same work item repeated across several iterations.
- No new durable artifact after an iteration that claimed progress.
- Same verification failure appears repeatedly.
- The worker rewrites the plan but does not execute it.
- Heartbeat or status timestamp is older than the expected phase duration.
- Supervisor respawns the same failing task without backoff or escalation.
Recovery sequence:
- Stop the current worker or refuse the next respawn.
- Preserve the latest checkpoint, logs, and verification output.
- Classify the stall: unclear requirement, failing dependency, repeated bug, or loop-control error.
- Escalate to human review when the next action requires judgment.
- Restart only after changing the state that caused the stall.
Do not recover from a stall by only increasing the iteration limit.
Pattern Selection
Use this decision table before implementing loop control.
| Situation | Recommended pattern |
|---|---|
| One task, one artifact, clear verification | Bounded single-run loop |
| One task that may need a few more turns | Sentinel continuation loop |
| Multiple related steps that may exceed context | Checkpoint-resume loop |
| Many independent queue items | Supervisor respawn loop |
| Side effects, approvals, or unclear requirements | Human-gated loop |
| Unknown done condition | Do not loop yet; define the stop condition first |
The rule of thumb: choose the simplest pattern that can stop safely and resume correctly.
Implementation Checklist
Before shipping an autonomous loop, answer these questions in writing:
- What exactly starts one iteration?
- What exactly proves the iteration made progress?
- What exactly means the whole loop is done?
- Where is state written so it survives restart?
- What cap stops repeated failure?
- How does a human see why the loop stopped?
- Which operation is safe to retry, and which requires approval?
If any answer is "the agent will remember", the design is not ready.
Anti-Patterns
| Anti-pattern | Why it fails | Safer replacement |
|---|---|---|
| Unbounded continue prompt | The model can retry forever without new evidence | External iteration cap plus blocked state |
| Prompt-only reliability | The model is both worker and watchdog | Runtime or supervisor enforces caps |
| Hidden progress in chat history | Restart loses the work state | Durable checkpoint or tracker update |
| Completion by silence | Quiet output is indistinguishable from a hang | Explicit done, blocked, failed, or timed-out state |
| Respawn without state change | Repeats the same failure | Backoff, classify, and escalate |
| One giant worker context | Context rot degrades decisions | Fresh worker per item or checkpoint handoff |
| Human gate buried in logs | Review never happens | Explicit approval state and visible stop reason |
Verification
After applying this skill, verify:
- The chosen pattern is the simplest one that can safely stop and resume.
- The loop has an explicit stop condition and an explicit blocked condition.
- At least one safety cap is enforced outside the worker agent.
- Durable state records objective, iteration, remaining work, latest evidence, and next action.
- Stall detection can identify repeated work, repeated failure, stale heartbeat, or no durable progress.
- Completion is based on tracker state, verification output, or a precise marker, not silence.
- Human handoff is visible when the loop reaches a cap or a judgment boundary.
Do NOT Use When
| Use instead | When |
|---|---|
agent-engineering |
Designing the full production agent system, including model routing, multi-agent coordination, and rollout policy |
prompt-craft |
Writing the exact instruction for one agent run or one worker prompt |
tool-call-strategy |
Optimizing how many tools one agent calls inside a single iteration |
context-management |
Deciding what information belongs in one worker's context |
observability-modeling |
Designing event names, spans, metrics, and trace attributes for the loop |
| Product-specific docs | Choosing a slash command, IDE feature, or hosted-agent setting in a particular tool |
Skill Graph context
Classification
- Subject:
agent-ops - Public:
true - Domain:
agent/loop-design - Scope: Use when designing, reviewing, or debugging an autonomous AI agent loop: repeated agent execution, completion signals, checkpoints, supervisor respawn, stall detection, safety caps, and human handoff rules. Covers the core loop patterns from simple bounded runs through sentinel-based continuation, checkpoint-resume, and external supervisor loops. Do NOT use for choosing a specific agent product command (use agent-engineering or the product's docs), writing ordinary task instructions (use prompt-craft), or optimizing individual tool calls (use tool-call-strategy).
When to use
- Triggers:
autonomous-loop-skill,loop-patterns-skill,agent-loop-design
Related skills
- Verify with:
observability-modeling,agent-engineering - Related:
prompt-craft,tool-call-strategy,context-management,agent-engineering,observability-modeling
Concept
- Mental model: An autonomous loop has six primitives: a trigger, a worker agent, a progress signal, a stop condition, durable state, and a safety cap. Different loop patterns place those primitives in different owners: a bounded run keeps them in the prompt and runtime limit, a sentinel loop keeps the stop condition in a completion marker, a checkpoint loop persists state between sessions, and a supervisor loop keeps restart and timeout policy outside the worker.
- Purpose: Autonomous loop patterns replace improvised keep-going instructions with explicit control design. They solve the failure mode where an agent keeps retrying without a stop rule, loses progress after a restart, or appears active while making no useful progress.
- Boundary: This skill owns loop control shape, not the work performed inside each iteration. Use prompt-craft for the wording of a single worker prompt, tool-call-strategy for per-tool efficiency, agent-engineering for broader multi-agent system architecture, context-management for what context to load, and observability-modeling for telemetry schema design.
- Analogy: An autonomous loop is an autopilot mode: it can keep flying, but only because it has instruments, altitude limits, a route, and a clear handoff back to a pilot.
- Common misconception: The common mistake is treating autonomy as permission to run forever. A safe loop is defined by when it stops, what state it writes, what evidence proves progress, and what cap forces human review.
Keywords
autonomous agent loop,agent loop pattern,completion signal,checkpoint resume loop,supervisor respawn,stall detection,safety cap,agent watchdog,human handoff