reliability-improvement-plan

star 180

Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan by analyzing IaC, scaling configs, and resilience patterns in the codebase.

aws-samples By aws-samples schedule Updated 6/15/2026

name: reliability-improvement-plan description: Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan by analyzing IaC, scaling configs, and resilience patterns in the codebase. not_for: security assessment, cost optimization, performance tuning, migration planning, full cross-pillar WA review version: 2.0.0

Reliability Improvement Plan

Step 1: Gather context

Ask the user:

What workload would you like me to assess for reliability? Please share:

  • Workload name and code packages/directories to analyze
  • Availability target (99.9%, 99.95%, 99.99%, etc.)
  • Recovery objectives (RTO and RPO if defined)
  • Past incidents (optional — recent outages or near-misses)

If context is already provided or you are in a codebase with IaC, proceed directly.

Step 2: Fault Tolerance Discovery

Analyze infrastructure for single points of failure.

You MUST examine:

  • Compute deployments (AZ distribution, instance count, ASG configs)
  • Database configurations (Multi-AZ, read replicas, cluster topology)
  • Cache configurations (cluster mode, replica counts, failover)
  • Load balancer configurations (cross-zone, health checks, target groups)
  • NAT Gateway placement (single vs per-AZ)
  • DNS configurations (Route 53 health checks, failover routing)
  • Queue and messaging configs (DLQ, redrive policies)
  • Storage redundancy (S3 replication, EBS snapshots, EFS)

For each component, document:

  • File path and line numbers
  • Current redundancy level (single-AZ, multi-AZ, multi-region)
  • Failure blast radius
  • Failover mechanism (automatic, manual, none)

You MUST flag as HIGH RISK:

  • Single-AZ database deployments for production workloads
  • Compute without auto-scaling (fixed instance count)
  • No health checks on load-balanced targets
  • Single NAT Gateway serving multiple AZs
  • Stateful services without replication
  • Missing DLQ on async invocations (Lambda, SQS, EventBridge)
  • No circuit breaker or timeout on external service calls

Step 3: Recovery Capability Discovery

Analyze backup and recovery configurations.

You MUST examine:

  • AWS Backup plans and rules
  • RDS automated backup settings (retention, PITR)
  • S3 versioning and replication rules
  • DynamoDB PITR and backup settings
  • EBS snapshot configurations
  • Cross-region replication rules
  • Disaster recovery configurations (pilot light, warm standby resources)

For each stateful resource, document:

  • Backup frequency and retention
  • Recovery point capability (RPO)
  • Recovery time estimate (RTO)
  • Whether recovery has been tested (look for DR runbooks, FIS experiments)

You MUST flag as HIGH RISK:

  • Stateful resources with no backup configuration
  • Backup retention < 7 days for production data
  • No cross-region backup for critical data
  • No evidence of recovery testing (no FIS experiments, no DR runbooks)

Step 4: Scaling and Capacity Discovery

Analyze scaling and capacity configurations.

You MUST examine:

  • Auto Scaling Group configurations (min, max, desired, scaling policies)
  • ECS service scaling (target tracking, step scaling)
  • Lambda concurrency settings (reserved, provisioned)
  • DynamoDB capacity mode (on-demand vs provisioned, auto-scaling)
  • SQS/Kinesis throughput configurations
  • API Gateway throttling settings
  • Service quota usage and alarms

You MUST flag as HIGH RISK:

  • Compute without auto-scaling policies
  • ASG with min = max (no scaling headroom)
  • No service quota alarms
  • Lambda without reserved concurrency on critical functions
  • No load shedding or throttling for overload scenarios

Step 5: Resilience Pattern Discovery

Analyze application code for resilience patterns.

You MUST examine:

  • Retry configurations (SDK clients, custom retry logic)
  • Timeout settings (HTTP clients, database connections, Lambda timeout)
  • Circuit breaker implementations
  • Fallback logic and graceful degradation patterns
  • Idempotency handling (idempotency keys, deduplication)
  • Health check endpoint implementations
  • Connection pooling configurations

For each external integration, document:

  • Timeout configured (or missing)
  • Retry policy (exponential backoff, max attempts, jitter)
  • Circuit breaker (present or absent)
  • Fallback behavior on failure

You MUST flag as HIGH RISK:

  • External service calls without timeouts
  • No retry logic on SDK clients
  • Missing idempotency on event-driven processing
  • Health checks that don't verify actual functionality (shallow checks)
  • Lambda timeout ≥ API Gateway timeout (will always timeout to caller)

Step 6: Change Management Discovery

Analyze deployment safety configurations.

You MUST examine:

  • Deployment strategies (canary, blue/green, rolling, all-at-once)
  • Health check gating on deployments
  • Automated rollback configurations (alarm-based)
  • Database migration strategies (backward-compatible, blue/green schema)
  • Feature flag usage

You MUST flag as HIGH RISK:

  • All-at-once deployment to production
  • No automated rollback on health check failure
  • Database migrations that aren't backward-compatible
  • No pre-production environment that mirrors production topology

---STOP--- Checkpoint: Discovery complete — present findings before evaluation.

Here is what I discovered about your workload's reliability:

  • Architecture: {summary of components and dependencies}
  • Single points of failure: {count identified so far}
  • Recovery capabilities: {summary of backup/DR status}

Shall I proceed with the full reliability evaluation, or would you like to adjust scope?

Do NOT proceed past this point until the user explicitly confirms.

Step 7: Evaluate against WA Framework questions

For each question, provide: Status, Evidence (file:line), Gaps, Risk.

REL 1 — How do you manage service quotas and constraints?

  • Evidence: quota alarms, SDK retry configs, throttling handling code

REL 2 — How do you plan your network topology?

  • Evidence: subnet definitions, AZ distribution, NAT redundancy

REL 3 — How does your system adapt to changes in demand?

  • Evidence: ASG configs, scaling policies, Lambda concurrency, DynamoDB capacity mode

REL 4 — How do you design interactions in a distributed system to prevent failures?

  • Evidence: retry logic, timeout configs, SQS decoupling, idempotency tokens

REL 5 — How do you design interactions to mitigate or withstand failures?

  • Evidence: circuit breaker code, fallback paths, bulkhead patterns, load shedding

REL 6 — How do you monitor workload resources?

  • Evidence: health check endpoints, alarm definitions, composite alarms, dashboard configs

REL 7 — How do you design your workload to adapt to changes in demand?

  • Evidence: scaling policy metrics, scheduled scaling, predictive scaling configs

REL 8 — How do you implement change?

  • Evidence: deployment configs, health check gating, rollback trigger alarms

REL 9 — How do you back up data?

  • Evidence: AWS Backup plans, PITR settings, replication rules, snapshot configs

REL 10 — How do you use fault isolation to protect your workload?

  • Evidence: AZ distribution, cell-based patterns, shuffle sharding, isolation boundaries

REL 11 — How do you design your workload to withstand component failures?

  • Evidence: multi-AZ configs, failover policies, stateless design, health-based routing

REL 12 — How do you test reliability?

  • Evidence: FIS experiments, failure injection code, game day runbooks, DR test scripts

REL 13 — How do you plan for disaster recovery (DR)?

  • Evidence: cross-region resources, DR automation, backup restore procedures, RTO/RPO docs

Step 8: Risk Assessment

For each finding, assess using Impact × Likelihood:

Impact: Minor (brief degradation, automatic recovery) | Moderate (extended outage for subset of users, manual intervention needed) | Severe (full outage, data loss, cannot recover within RTO)

Likelihood: Low (requires multiple simultaneous failures) | Medium (single component failure could trigger) | High (normal operational event could trigger, no redundancy)

Impact Likelihood Risk Level
Severe High Critical
Severe Medium High
Severe Low High
Moderate High High
Moderate Medium Medium
Moderate Low Medium
Minor High Medium
Minor Medium Low
Minor Low Low

---STOP--- Checkpoint: Assessment complete — confirm findings before generating remediation plan.

Assessment summary:

  • Critical findings: {count}
  • High findings: {count}
  • Medium/Low findings: {count}

Shall I produce the full remediation plan, or would you like to discuss specific findings first?

Do NOT proceed past this point until the user explicitly confirms.

Step 9: Produce the plan

# Reliability Improvement Plan: {Workload Name}

## Executive Summary
- **Date**: {date}
- **Availability Target**: {target}
- **Packages Analyzed**: {list}
- **Findings**: {X} Critical, {Y} High, {Z} Medium, {W} Low
- **Overall Reliability Maturity**: {1-5} — {one-line justification}

## Reliability Scorecard
| Domain | Score (1-5) | Key Strength | Key Gap |
|--------|-------------|--------------|---------|
| Fault Tolerance | {score} | {strength} | {gap} |
| Recovery & Backup | {score} | {strength} | {gap} |
| Scaling & Capacity | {score} | {strength} | {gap} |
| Resilience Patterns | {score} | {strength} | {gap} |
| Change Management | {score} | {strength} | {gap} |
| Testing & Validation | {score} | {strength} | {gap} |

## Single Points of Failure
| Component | Evidence | Failure Impact | Current Mitigation | Risk Level |
|-----------|----------|---------------|-------------------|------------|
| {component} | {file:line} | {impact} | {mitigation or "None"} | {Critical/High/Medium/Low} |

## Critical and High Risk Findings
{For each: ID, domain, title, description, evidence (file:line), impact assessment, recommendation, effort, AWS services}

## Medium and Low Risk Findings
{Condensed format}

## Prioritized Remediation Plan

### Quick Wins (< 1 week)
| Finding | Action | Impact | Effort |
|---------|--------|--------|--------|
{Enable Multi-AZ, add health checks, configure DLQs, add timeouts}

### Foundation (1-4 weeks)
| Finding | Action | Impact | Effort | Dependencies |
|---------|--------|--------|--------|--------------|
{Auto-scaling, circuit breakers, backup configs, deployment safety}

### Strategic (1-3 months)
| Finding | Action | Impact | Effort | Dependencies |
|---------|--------|--------|--------|--------------|
{Multi-region DR, chaos engineering, cell-based architecture}

## Testing Plan
| Test | Validates | Frequency | AWS Service | Evidence Exists |
|------|-----------|-----------|-------------|-----------------|
| AZ failover | Compute survives AZ loss | Monthly | FIS | {Yes/No} |
| Database failover | RDS failover < 60s | Quarterly | FIS | {Yes/No} |
| Load test | Handles 2x peak | Before releases | Load Testing | {Yes/No} |
| Backup restore | RPO met, data recoverable | Monthly | AWS Backup | {Yes/No} |
| Deployment rollback | Bad deploy reverted < 5 min | Every deploy | CodeDeploy | {Yes/No} |

## Next Steps
{Top 5 concrete reliability actions the team should take this week}

Step 10: Offer follow-up

After delivering the plan, offer:

Would you like me to:

  • Design multi-AZ architecture for a specific component?
  • Generate FIS experiment templates for chaos engineering?
  • Implement circuit breaker patterns for service dependencies?
  • Create backup and DR IaC for stateful resources?
  • Design a deployment safety configuration with automated rollback?

Calibration Guidance

  • A workload with Multi-AZ, auto-scaling, health checks, automated rollback, and backups is MATURE — focus on advanced testing (chaos engineering, DR drills, game days)
  • Every finding MUST have code evidence — don't flag "missing Multi-AZ" without checking the IaC
  • For data pipelines: prioritize data durability over compute availability — message loss is worse than processing delay
  • Match expectations to availability target: 99.9% doesn't require multi-region, 99.99% does
  • "Cannot Determine" is valid for operational aspects not visible in code (e.g., whether DR drills are actually run)
  • Acknowledge existing reliability patterns prominently before listing gaps
Install via CLI
npx skills add https://github.com/aws-samples/sample-well-architected-skills-and-steering --skill reliability-improvement-plan
Repository Details
star Stars 180
call_split Forks 30
navigation Branch main
article Path SKILL.md
More from Creator