name: continuity-loop description: >- Own the unattended renewal spine: renewal ticks, the two-tick stall rule, escalation for NTM panes over MCP Agent Mail. Use when wiring or tuning a loop's continuity step. practices:
- sre
- team-topologies
- design-by-contract hexagonal_role: supporting consumes:
- ntm
- agent-mail produces:
- .agents/continuity/state.json
- escalation-message context_rel:
- kind: supplier-to with: evolve
- kind: supplier-to with: using-atm
- kind: supplier-to with: recover
- kind: customer-of with: agent-mail
- kind: customer-of
with: ntm
skill_api_version: 1
user-invocable: false
context:
window: fork
intent:
mode: task
sections:
exclude:
- HISTORY intel_scope: topic metadata: tier: orchestration stability: experimental dependencies:
- agent-mail
- ntm output_contract: .agents/continuity/state.json (renewed per tick) + Agent Mail escalation message on stall
/continuity-loop — The Unattended Renewal Spine
This skill owns the renewal contract behind the live multi-agent substrate the repo CLAUDE.md declares: NTM panes + MCP Agent Mail + renewal ticks. It does not run agents and it does not schedule anything — it defines what a tick is, what every supervised lane must do per tick, how a stall is detected, and when a stall becomes an escalation. Loop skills compose it; they do not reimplement it.
Use when: a loop skill needs a continuity step, a lane looks dead and you need a verdict (suspect / stalled / converged), or you are deciding whether to nudge, relaunch, or escalate.
⚠️ Critical Constraints
- This skill is a contract, not a scheduler — because AgentOps 3.0 ships no daemon. Tick firing is owned by host timing (cron, systemd user timer, or the operator tending loop); a substrate that owns its own clock drifts from the hookless doctrine.
.agents/continuity/state.jsonis the only continuity state surface, because two state surfaces guarantee a split-brain stall verdict; every consumer reads and renews the same file.- Two ticks before any stall verdict — because a single tick cannot distinguish a slow tool call from a wedge. One missed renewal marks a lane SUSPECT, never STALLED; acting on one tick produces nudge-storms that kill healthy lanes.
- Escalation is a message, never a silent kill, because the tender and the operator must see the same evidence trail; a killed pane with no Agent Mail record is indistinguishable from a crash.
- Never route continuity through
rpiorbd— retired legacy — to prevent reviving a SPOF the substrate already paid to remove: the in-session ao-rpi loop is retired as the live workflow and bd/Dolt was retired 2026-06-11 (see Retired legacy). - Healthy lanes are left alone, because interruptions reset agent context — a lane renewing on cadence gets no nudge, no message, no restart; the intervention IS the failure mode when the lane was fine.
What a renewal tick is
A renewal tick is one bounded observation pass over the supervised lanes: read each lane's pane activity (NTM robot state) and Agent Mail traffic, decide per lane whether forward progress happened since the last tick, and renew that lane's entry in the state surface. A tick observes and records; intervention is a separate decision the tick's output feeds.
Forward progress evidence (any one suffices):
| Signal | Source |
|---|---|
| New pane output delta since last tick | NTM pane/robot state |
| New Agent Mail message or reservation activity | am inbox / reservations |
Lane self-renewed its state.json entry |
the state surface |
| Work-product delta (commit, closed item, new artifact) | git / tracker / .agents/ |
Tick cadence and the renewal contract
- Default cadence: one tick per 10 minutes of unattended operation; tighten
for short-lived swarms, loosen for overnight runs. Cadence lives in the host
timer, and is recorded in
state.jsonso consumers can compute staleness. - Renewal contract: every supervised lane gets its
last_renewalandtick_seqadvanced each tick — either self-renewed by the lane or renewed by the tender on observed evidence. An entry whosetick_seqis behind the global tick counter by one is SUSPECT; behind by two or more is STALLED. - A lane that finishes cleanly is marked
convergedand leaves supervision; it is never reported as stalled.
Stall detection — the two-tick rule
- Tick N: no forward-progress evidence → mark lane
suspect. No action. - Tick N+1: still no evidence → mark lane
stalled. Now intervene, in order: one nudge → if the next tick shows no recovery, relaunch the lane and route it through recover → if the relaunched lane stalls again on the same work item, escalate. - Any forward-progress evidence at any point resets the lane to
active.
State surface
Path: .agents/continuity/state.json — single file, renewed in place per
tick (write temp + rename).
{
"schema": "continuity-state.v1",
"tick_seq": 42,
"cadence_minutes": 10,
"last_tick": "2026-06-12T14:30:00Z",
"lanes": [
{
"lane": "ntm:agentops:pane-3",
"agent": "claude-worker-3",
"work_item": "ag-xxxx",
"status": "active",
"tick_seq": 42,
"last_renewal": "2026-06-12T14:30:00Z",
"evidence": "am message 2026-06-12T14:28Z; pane delta +214 lines"
}
]
}
status ∈ active | suspect | stalled | converged | escalated. Every status
change cites its evidence string — a bare status flip is invalid.
How loop skills route their continuity step here
- evolve — its halt/handoff machinery (non-sticky
HANDOFF, context exhaustion is not a stop) maps onto the tick contract: a
handoff written this tick is forward-progress evidence, and the next tick
expects the resumed lane to renew. Evolve consults
state.jsoninstead of inventing its own liveness probe. - using-atm — the ATM tending loop is the reference tick executor: each tending pass over the panes is one renewal tick, and its nudge/relaunch/re-dispatch triggers are the intervention ladder this skill sequences. Pane-truth beats roster state when they disagree.
- recover — the re-entry adapter. A relaunched lane
runs recover to rebuild context, then renews its
state.jsonentry; that renewal closes the stall.
Escalation criteria
Escalate (Agent Mail message to the operator/tender lane + lane status
escalated) when any of these hold:
- Two-tick stall that survived one nudge and one relaunch.
- The same work item stalls two different lanes (the work is poisoned, not the lane).
- Auth or rate-limit failure that account rotation did not clear.
- A file-reservation conflict on the lane's write surface (route to agent-mail coordination, not to a retry).
- A lane is re-doing work its own
evidencetrail shows complete (loop without progress — context saturation).
Retired legacy (explicitly NOT live)
- The in-session ao-rpi loop — retired as the live workflow. Its run directories are not a continuity surface; do not write renewal state there.
- bd/Dolt tracker liveness — retired 2026-06-11. Tracker reachability is
not a lane-liveness signal and
state.jsonnever syncs into it.
Both are listed here only so nobody routes a continuity step back into them.
Output Specification
Format: JSON file (continuity-state.v1, shown above) renewed atomically
each tick, plus an Agent Mail escalation message when a lane reaches escalated.
Path: .agents/continuity/state.json, written to in place (temp + rename);
escalations go to the operator/tender lane via am.
Exit signal: a tick reports lanes: N active / N suspect / N stalled / N converged to the tender.
Quality Rubric
- No stall verdict was issued on a single missed tick (two-tick rule held)
- Every status change in
state.jsoncarries an evidence string - Escalations exist as Agent Mail messages, not just status flips
- No continuity state was written outside
.agents/continuity/state.json - No live-path reference to rpi or bd anywhere in the continuity step
See Also
- using-atm — the reference tick executor (ATM tending loop)
- vibing-with-ntm — pane-level liveness truth and unsticking
- agent-mail — the message + reservation leg escalations ride on
- evolve, recover — loop consumers of this contract
- handoff — what a context-saturated lane writes before relaunch