name: handle-errors source: botcore description: > Provides error handling patterns and recovery strategies for TypeScript and Python applications. Covers custom error hierarchies, Result/Either types (neverthrow, Effect), discriminated union errors, retry with exponential backoff and circuit breakers, RFC 9457 structured API error responses, exception groups, error observability, and agent-safe error recovery with graceful degradation. Use when designing error handling strategies, implementing Result types, adding retry logic, building error hierarchies, creating structured API errors, or making agent workflows resilient. Triggers: error handling, Result type, Either, neverthrow, Effect, retry, circuit breaker, backoff, error hierarchy, RFC 9457, error recovery, exception, try-catch, error boundary, structured errors, error logging.
version: 1.0.0 triggers: - error handling - Result type - Either type - neverthrow - Effect TS - retry logic - circuit breaker - exponential backoff - error hierarchy - custom error class - RFC 9457 - Problem Details - error recovery - exception handling - try-catch - error boundary - structured errors - error logging - error tracking - tenacity - graceful degradation - fault tolerance portable: true
Handling Errors
Expert guidance for designing and implementing error handling, recovery, and reporting strategies across TypeScript and Python applications, APIs, and AI agent workflows.
Capabilities
- Custom Error Hierarchies -- Domain-specific error classes with context, codes, and metadata in TypeScript and Python
- Result/Either Types -- Type-safe error handling with neverthrow, Effect, and discriminated unions that make errors explicit in the type system
- Retry and Circuit Breaker Patterns -- Exponential backoff with jitter, circuit breakers, bulkheads, and resilience strategies for transient failures
- Structured API Errors -- RFC 9457 Problem Details, error code registries, and machine-readable error responses
- Exception Groups and Aggregation -- Python ExceptionGroup patterns, error accumulation, and multi-error reporting
- Error Observability -- Structured error logging, error tracking integration (Sentry), error budgets, and alerting strategies
- Agent-Safe Error Recovery -- Graceful degradation, fallback chains, compensation patterns, and error handling for agentic workflows
Routing Logic
| Request type | Load reference |
|---|---|
| Custom error classes, error hierarchies, domain errors, error context | references/error-hierarchies.md |
| Result type, Either, neverthrow, Effect, discriminated unions, typed errors | references/result-types.md |
| Retry, backoff, jitter, circuit breaker, bulkhead, timeout, resilience | references/retry-and-resilience.md |
| RFC 9457, Problem Details, API errors, error codes, error catalog | references/api-error-design.md |
| Error logging, Sentry, observability, error budgets, tracking, alerts | references/error-observability.md |
| Agent error recovery, fallback, compensation, graceful degradation | references/agent-error-recovery.md |
Core Principles
1. Errors Are Values, Not Surprises
Make errors explicit in the type system. Functions that can fail should declare it in their signature, not hide failures in thrown exceptions.
// Bad: Caller has no idea this can fail
function getUser(id: string): User { ... }
// Good: Failure is explicit in the return type
function getUser(id: string): Result<User, NotFoundError | DatabaseError> { ... }
# Bad: Undeclared exception
def get_user(user_id: str) -> User: ...
# Good: Explicit result with type hint
def get_user(user_id: str) -> Result[User, NotFoundError | DatabaseError]: ...
Reserve thrown exceptions for truly unexpected situations (bugs, invariant violations). Handle expected failures through return values.
2. Classify Every Error
Every error in the system must be classified along two axes:
Severity: Is this a warning, a recoverable error, or a fatal defect?
Recoverability: Can the system retry, fall back, or must it abort?
| Category | Retryable | Action | Examples |
|---|---|---|---|
| Transient | Yes | Retry with backoff | Network timeout, 503, rate limit |
| Correctable | Sometimes | Fix input and retry | Validation error, auth expired |
| Permanent | No | Fail fast, report | Not found, forbidden, schema mismatch |
| Fatal | No | Abort, alert | Out of memory, config missing, invariant broken |
3. Provide Context, Not Just Messages
Every error must carry enough context for debugging without requiring log correlation:
// Bad
throw new Error('Database query failed');
// Good
throw new DatabaseError('Failed to fetch user by ID', {
code: 'DB_QUERY_FAILED',
cause: originalError,
context: { userId, query: 'SELECT * FROM users WHERE id = ?', latencyMs: 3200 },
retryable: true,
});
Include: error code, original cause, operation context, whether it is retryable, and a trace/correlation ID.
4. Fail Fast on Non-Recoverable Errors
Do not retry or catch errors that cannot be resolved:
- Missing configuration at startup -- crash immediately with a clear message
- Schema validation failures -- reject and return structured error
- Authorization failures -- do not retry; surface to user
- Invariant violations -- these are bugs; crash and log for investigation
5. Retry Only Transient Failures
Apply retry logic only to errors that are genuinely transient:
Retry: network timeout, 429, 500, 502, 503, 504, connection reset
No retry: 400, 401, 403, 404, 409, 422, config errors, type errors
Always use exponential backoff with jitter. Set a maximum retry count and total timeout. Log each retry attempt with the attempt number.
6. Structured Errors for APIs
All API error responses must be machine-readable and consistent. Use RFC 9457 Problem Details for HTTP APIs:
{
"type": "https://api.example.com/errors/validation-failed",
"title": "Validation Failed",
"status": 422,
"detail": "The 'email' field must be a valid email address.",
"errors": [
{ "field": "email", "code": "INVALID_FORMAT", "message": "Must be a valid email" }
],
"trace_id": "req_abc123"
}
7. Errors Must Be Observable
Every error that reaches production must be captured, categorized, and trackable:
- Structured logging with error code, severity, context, and trace ID
- Error tracking integration (Sentry or equivalent) for aggregation and alerting
- Error budgets tied to SLOs for prioritizing reliability work
- Dashboards showing error rates by category, endpoint, and service
Quick Reference
Error Classification Decision Tree
Is the error expected? (validation, not found, auth)
YES -> Return as Result/Either value or structured API error
NO -> Is it a known infrastructure failure? (timeout, 503)
YES -> Is it transient?
YES -> Retry with backoff, then degrade gracefully
NO -> Fail fast, return structured error, alert
NO -> This is a defect (bug)
-> Log with full context, send to error tracker, alert
TypeScript Error Pattern Selection
Need explicit return-type errors? -> Result<T, E> (neverthrow)
Need composable error pipelines? -> Effect<A, E, R>
Need simple success/failure branching? -> Discriminated union
Need error boundaries in React? -> ErrorBoundary component
Need API error responses? -> RFC 9457 Problem Details
Need retry logic? -> Exponential backoff + circuit breaker
Python Error Pattern Selection
Need domain error hierarchy? -> Custom exception classes
Need retry on transient failures? -> tenacity decorator
Need structured API errors? -> Pydantic model + RFC 9457
Need multi-error reporting? -> ExceptionGroup (3.11+)
Need context cleanup? -> contextlib (suppress, ExitStack)
Need result-type pattern? -> returns library or custom
Retry Configuration Defaults
| Parameter | Default | Notes |
|---|---|---|
| Max attempts | 3-5 | Depends on operation criticality |
| Base delay | 1 second | Starting backoff interval |
| Max delay | 30-60 seconds | Upper bound cap |
| Backoff multiplier | 2 | Doubles each attempt |
| Jitter | 0-100% of delay | Prevents thundering herd |
| Total timeout | 2-5 minutes | Circuit breaker threshold |
Workflow
Error Handling Design Protocol
1. Inventory Error Sources
- List all external dependencies (APIs, databases, queues, file systems)
- Identify all user input validation points
- Map authorization and authentication failure modes
- Document infrastructure failure scenarios (network, DNS, disk)
- Identify business logic failure conditions
2. Classify and Categorize
- Assign each error source a severity (warning, error, fatal)
- Determine recoverability (transient, correctable, permanent, fatal)
- Define error codes for each category (use a registry)
- Document which errors are retryable and with what strategy
- Map errors to user-facing messages (never expose internals)
3. Implement Error Types
- Create a base error class with code, context, cause, and retryable flag
- Build domain-specific error subclasses for each category
- Implement Result/Either types for operations with expected failures
- Add discriminated unions for function return types where appropriate
- Ensure all errors are serializable for logging and API responses
4. Add Resilience Patterns
- Implement retry with exponential backoff and jitter for transient failures
- Add circuit breakers for external service calls
- Configure timeouts for all I/O operations
- Implement fallback strategies (cache, default, degraded mode)
- Add bulkhead isolation for critical vs non-critical paths
5. Wire Up Observability
- Configure structured logging with error context fields
- Integrate error tracking (Sentry or equivalent)
- Set up alerting rules by error severity and rate
- Define error budgets tied to SLOs
- Create dashboards for error rate monitoring
6. Test Error Paths
- Unit test every error branch and recovery path
- Test retry behavior with simulated transient failures
- Test circuit breaker state transitions (closed -> open -> half-open)
- Verify error responses match RFC 9457 schema
- Test graceful degradation under dependency failure
- Chaos test: inject random failures in staging
Agentic Workflow Considerations
When an AI agent is implementing or modifying error handling:
- Preserve existing error contracts -- Changing error types or codes is a breaking change for consumers. Always check for existing error handlers downstream before modifying.
- Never swallow errors silently -- Every catch block must log, rethrow, or return a typed error. Empty catch blocks are defects.
- Test the unhappy path -- Generate tests for error conditions, not just success paths. Cover retry exhaustion, circuit breaker trips, and fallback activation.
- Use structured errors for agent-to-agent communication -- When agents call other agents or tools, use typed Result values or structured error objects, never raw strings.
- Implement compensation for multi-step operations -- If step 3 of 5 fails, the agent should know how to undo steps 1 and 2 or leave the system in a consistent state.
- Surface errors clearly -- When an agent encounters an error it cannot recover from, it should report the error with full context to the human, including what was attempted, what failed, and what the state is now.
- Respect error budgets -- If an agent detects elevated error rates, it should slow down or pause rather than contributing to cascading failures.
Checklist
Error Types
- Base error class includes code, message, cause, context, and retryable flag
- Domain errors extend base with category-specific fields
- All error codes are registered in a central catalog
- Error messages are user-safe (no stack traces, internal IDs, or secrets)
- Errors carry correlation/trace IDs for distributed tracing
Result Types (if using)
- All functions with expected failures return Result<T, E> instead of throwing
- Error variants are exhaustively handled (no unchecked
.valueaccess) - Async operations use ResultAsync or equivalent
- Result chains use map/flatMap, not try-catch wrappers
- Error types in Result are specific, not
Errororunknown
Retry and Resilience
- Retry logic uses exponential backoff with jitter
- Maximum retry count and total timeout are configured
- Only transient errors trigger retries (not 400, 401, 403, 404, 422)
- Circuit breaker configured for external service calls
- Timeouts set on all I/O operations
- Fallback strategies defined for critical paths
- Retry attempts are logged with attempt number and delay
API Errors
- All error responses follow RFC 9457 Problem Details format
- Content-Type is
application/problem+jsonfor error responses - Error codes are stable strings from a published registry
- Validation errors include field-level details (field, code, message)
- Retryable errors include
retry_after_seconds - Error responses never expose stack traces, SQL, or internal paths
Observability
- Errors logged with structured fields (code, severity, context, trace_id)
- Error tracking service captures unhandled exceptions with source maps
- Alert rules defined for error rate spikes and new error types
- Error budgets established and monitored against SLOs
- Dashboards show error rates by service, endpoint, and error code
When to Escalate
- Changing error codes or types on a public API -- This is a breaking change. Requires versioning plan and consumer notification.
- Error rates exceeding SLO budget -- Indicates a systemic issue needing architecture review, not just error handling fixes.
- Cascading failures across services -- Circuit breaker and bulkhead patterns may need infrastructure-level changes.
- Cryptographic or security-sensitive errors -- Errors in auth, encryption, or secret handling need security team review.
- Multi-service compensation logic -- Saga/compensation patterns spanning multiple services need distributed systems expertise.
- Error handling in life-safety or financial systems -- Regulatory requirements may dictate specific error handling behaviors.
- Persistent retry storms -- If retries are amplifying failures rather than resolving them, the retry strategy needs redesign.
- Agent error loops -- If an agent repeatedly encounters the same error and cannot self-correct, it should halt and escalate to a human with full context.