error-recovery - SKILL.md Agent Skill

name: "error-recovery" description: "Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure." domain: "reliability, agent-coordination" confidence: "high" license: MIT

Error Recovery Patterns

Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.

1. Retry with Backoff

When: Transient failures — API timeouts, rate limits, network errors, temporary service unavailability.

Pattern:

Wait briefly, then retry (start at 2s, double each attempt)
Maximum 3 retries before escalating
Log each attempt with the error received

Example: API call returns 429 Too Many Requests → wait 2s → retry → wait 4s → retry → wait 8s → retry → escalate if still failing.

2. Fallback Alternatives

When: Primary tool or approach fails and an alternative exists.

Pattern:

Attempt primary approach
On failure, identify alternative tool/method
Try the alternative with the same intent
Document which alternative was used and why

Example: Primary CLI tool fails → fall back to direct API call for the same operation.

3. Diagnose-and-Fix

When: Build failures, test failures, linting errors — structured errors with actionable output.

Pattern:

Read the full error output carefully
Identify the root cause from error messages
Attempt a targeted fix
Re-run to verify the fix
Maximum 3 fix-retry cycles before escalating

Example: Build fails with a type error → check for missing import → add it → rebuild.

4. Escalate with Context

When: Recovery attempts have been exhausted, or the failure requires human judgment.

Pattern:

Summarize what was attempted and what failed
Include the exact error messages
State what you believe the root cause is
Suggest next steps or who might be able to help
Hand off to the coordinator or the appropriate specialist

Example: After 3 failed build attempts → "Build fails on line 42 with null reference. Tried X, Y, Z. Likely a design issue in the Foo module. Recommend the code owner review."

5. Graceful Degradation

When: A non-critical step fails but the overall task can still deliver value.

Pattern:

Determine if the failed step is critical to the task outcome
If non-critical, log the failure and continue
Deliver partial results with a clear note of what was skipped
Offer to retry the skipped step separately

Example: Generating a report with 5 sections — section 3 data source is unavailable → produce the report with 4 sections, note that section 3 was skipped and why.

Applying These Patterns

Each agent should reference these patterns in their charter's ## Error Recovery section, tailored to their domain. The charter should list the agent's most common failure modes and map each to the appropriate pattern above.

Selection guide:

Failure Type	Primary Pattern	Fallback Pattern
Network/API transient	Retry with Backoff	Escalate with Context
Tool/dependency missing	Fallback Alternatives	Escalate with Context
Build/test error	Diagnose-and-Fix	Escalate with Context
Auth/permissions	Retry with Backoff	Escalate with Context
Non-critical data missing	Graceful Degradation	—
Unknown/novel error	Escalate with Context	—