name: escalation-ladder description: Time-graded recovery escalation for the reconciliation motor — Retry → RefreshSource → ReloadGems → RecycleWorkers → PauseAndAlert as durations grow without reaching Ready. Use when the user describes a recurring failure the motor can't recover from, when asked how the controller should respond when X persists for N minutes, or when designing recovery policy for a new bug class. Fix-axis sibling of controller-detection-axis; codified in pangea-operator/src/controller/escalation.rs. allowed-tools: Read, Write, Edit, Glob, Grep, Bash metadata: version: "0.1.0" domain_keywords: - "escalation" - "recovery" - "stuck" - "settling" - "anomaly" - "controller" - "reconciliation" - "recovery-policy" - "pause-and-alert" - "pangea-operator"
escalation-ladder — Time-graded recovery
The reconciliation motor needs progressively deeper corrective actions when a template can't reach Ready. This skill names the pattern, the production-default ladder, and the wire-in shape so any new failure surface gets recovery semantics by default.
The default ladder (pangea-operator)
| Rung | After | Action | Handles |
|---|---|---|---|
| 0 | 0s | Retry |
normal reconcile path |
| 1 | 5 min | RefreshSource |
"source moved, our clone is stale" |
| 2 | 15 min | ReloadGems |
"in-process Ruby state is wedged; FS is fine" |
| 3 | 30 min | RecycleWorkers |
"the CRuby VM is irrecoverable; kill pool" |
| 4 | 60 min | PauseAndAlert |
"we tried everything; human required" |
Each action is idempotent. Each label is stable (locked by test). depth() orders for comparison + dashboards.
When to invoke
| Situation | Apply? | Why |
|---|---|---|
| A new bug class with recurring failures | Yes | Surface the right rung from day 1; future handlers slot in. |
| A new template type / controller arm | Yes | Wire the ladder into its failure path so recovery is automatic. |
| "How should the controller respond to X persistent for N minutes?" | Yes | Map N to the right rung. |
| A one-shot bug fix | No | The ladder is overhead — fix it and move on. |
| Errors fixed by configuration retry alone | No | settlingPolicy + retryPolicy handle that; the ladder is for time-graded depth. |
Two classes covered
- Known knowns — typed
Conflictfrom a detector named the bug class. Ladder picks action proportional to persistence. - Known unknowns / unknown unknowns — controller saw an error it can't classify. Ladder still applies because the gate is TIME, not error shape. Rung 4 forces human attention before infinite cycle waste.
Codified API
pangea-operator/src/controller/escalation.rs:
pub enum EscalationAction { Retry, RefreshSource, ReloadGems, RecycleWorkers, PauseAndAlert }
impl EscalationAction { fn label(&self) -> &'static str; fn depth(&self) -> u8; }
pub struct EscalationRung { pub min_duration_unready: Duration, pub action: EscalationAction }
pub struct EscalationLadder { /* sorted Vec<EscalationRung> */ }
impl EscalationLadder {
pub fn pangea_default() -> Self;
pub fn from_rungs(rungs: Vec<EscalationRung>) -> Self; // sort-on-construct
pub fn pick(&self, duration_unready: Duration) -> EscalationAction;
pub fn rungs(&self) -> &[EscalationRung];
}
7 unit tests pass. PURE — no async, no I/O, no global state.
Wire-in recipe
Call from any controller arm that handles a failure. The minimum useful wire is surface-only (log + status), valuable immediately even before action handlers ship:
let now = chrono::Utc::now();
let duration_unready = template.status.as_ref()
.and_then(|s| s.phase_entered_at.as_ref())
.map(|t| (now - *t).to_std().unwrap_or(Duration::ZERO))
.unwrap_or(Duration::ZERO);
let action = EscalationLadder::pangea_default().pick(duration_unready);
tracing::info!(
template = %name,
duration_unready_s = duration_unready.as_secs(),
recommended_action = action.label(),
depth = action.depth(),
"escalation ladder recommendation"
);
// Bake into lastError / Event message:
let msg = format!(
"{} (recovery ladder recommends '{}' at depth {}, {}s unready)",
original_msg, action.label(), action.depth(), duration_unready.as_secs(),
);
Then a slice-5 follow-up wires the action handlers:
match action {
Retry => { /* no extra */ }
RefreshSource => invalidate_workspace_cache(&template).await?,
ReloadGems => state.compiler_backend.reload_all_gems().await?,
RecycleWorkers => state.ruby_pool.recycle_all().await?,
PauseAndAlert => set_autosuspended_with_event(&template, &state).await?,
}
Each handler is its own primitive — add one variant at a time. The ladder doesn't block on handlers being present.
Anti-patterns to flag
| Anti-pattern | Why bad | Right move |
|---|---|---|
| Hard-coding the actions in the controller arm | Doesn't compose; new arms duplicate the logic | Use the EscalationLadder primitive everywhere |
Skipping pause_and_alert because "we should always retry" |
Burns cycles forever on unrecoverable conditions | The deepest rung exists exactly for unknown-unknowns |
| Using cycle-count instead of duration | Doesn't honor "long enough to act" semantics | Duration::from_secs(...) is the gate; cycle-count is the orthogonal settlingPolicy signal |
| Making the action non-idempotent | Hazardous if rung fires twice across restarts | Every action MUST be idempotent (test it) |
| Surfacing the action only in logs (no status) | Operators can't see it via kubectl | Bake into status.lastError text + emit Event |
Per-CR override (future / slice 4)
spec.recoveryPolicy.rungs[] lets a template override the default ladder. Production-aggressive workspaces shorten timings; production-tolerant lengthen. from_rungs(...) sorts on construction so CR ordering doesn't matter.
spec:
recoveryPolicy:
rungs:
- afterSeconds: 60
action: RefreshSource
- afterSeconds: 600
action: PauseAndAlert
Composes with
- controller-detection-axis skill — the detection axis NAMES the anomaly via
ConflictDetector; this skill TAKES ACTION over time. SameConflictshape, different axis. settling.rs— provides stuck signals (cycle-count + fingerprint). Orthogonal to time-graded depth.error_policy.rs— categorizes errors. Pre-step to the ladder.
Workflow when invoking
- Smell-check: does the failure recur, with no scalar fix? If yes, ladder applies.
- Map persistence to depth: how long should X persist before each rung? Use the default unless the user names different timings.
- Pick the wire-in arm: which
handle_<X>_failurefunction adds the ladder call? - Surface-first: log + status + Event. Handlers later.
- Pure tests: TempDir-free; just
Duration::from_secs(...)inputs andEscalationActionassertions.
Related memories
memory/project_escalation_ladder.md— durable knowledge.memory/project_controller_detection_axis.md— the detect sibling axis.memory/project_operator_observability_backlog.md— slice-4 status field consumer.
Triggers
Invoke when:
- User describes a recurring failure the motor can't recover from.
- User asks "how should the controller respond when X persists for N minutes".
- Adding a new failure-handling arm to a controller.
- Designing recovery policy for a new bug class.
- A template is stuck at a non-Ready phase with high consecutive failure counts.
DO NOT invoke for:
- One-shot bug fixes.
- Bugs already handled by settlingPolicy + retryPolicy alone.
- Bugs where the structural fix is obvious + immediate (just fix it).