name: backup-recovery-policy description: Business continuity and disaster recovery: 30-day retention, quarterly restore tests, RTO/RPO targets per ISO 27001 A.17 license: Apache-2.0
Backup and Recovery Policy Skill
Purpose
This skill provides systematic guidance for implementing business continuity and disaster recovery within the CIA platform, ensuring data protection aligns with business impact analysis, RTO/RPO targets, and ISO 27001 Annex A.17 requirements.
When to Use This Skill
Apply this skill when:
- ✅ Designing backup strategies for new systems or data stores
- ✅ Configuring AWS backup services (RDS snapshots, S3 versioning, EBS backups)
- ✅ Implementing disaster recovery procedures
- ✅ Conducting quarterly backup restore tests
- ✅ Defining RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets
- ✅ Planning for data retention and archival
- ✅ Responding to data loss incidents
- ✅ Conducting business continuity planning
Do NOT skip for:
- ❌ Development/test environments (may contain production data copies)
- ❌ Temporary data stores (may become permanent)
- ❌ "Easily reproducible" data (validate recovery procedures)
- ❌ Third-party managed services (verify backup capabilities)
Business Impact-Driven Backup Framework
RTO/RPO Targets by Classification
| Classification | Business Impact | RPO Target | RTO Target | Backup Frequency | Retention |
|---|---|---|---|---|---|
| RESTRICTED | Extreme | <15 minutes | <1 hour | Continuous replication | 30 days minimum |
| CONFIDENTIAL | Very High | <4 hours | <4 hours | Hourly | 7 years (financial data) |
| INTERNAL | Moderate | <24 hours | <24 hours | Daily | 3 years |
| PUBLIC | Low | >24 hours | >72 hours | Weekly | Indefinite |
Backup Strategy Decision Tree
graph TD
START["💾 Backup Need"] --> CLASSIFY{"🏷️ Data Classification?"}
CLASSIFY -->|RESTRICTED| CRITICAL["🔴 Critical Backup<br/>Continuous Replication"]
CLASSIFY -->|CONFIDENTIAL| HIGH["🟠 High Priority<br/>Hourly Backups"]
CLASSIFY -->|INTERNAL| MEDIUM["🟡 Medium Priority<br/>Daily Backups"]
CLASSIFY -->|PUBLIC| STANDARD["🟢 Standard<br/>Weekly Backups"]
CRITICAL --> IMPACT{"📊 Business Impact?"}
HIGH --> IMPACT
MEDIUM --> IMPACT
STANDARD --> IMPACT
IMPACT -->|Financial System| FINANCE["💰 Financial Data<br/>7-year retention"]
IMPACT -->|Core Operations| CORE["🏗️ Core Systems<br/>1-year retention"]
IMPACT -->|Support Functions| SUPPORT["🛠️ Support Systems<br/>3-month retention"]
FINANCE --> METHOD{"🔧 Backup Method?"}
CORE --> METHOD
SUPPORT --> METHOD
METHOD -->|Database| RDS["📊 AWS RDS<br/>Automated Snapshots"]
METHOD -->|Files| S3["📁 AWS S3<br/>Versioning + Lifecycle"]
METHOD -->|Infrastructure| IaC["🏗️ Infrastructure as Code<br/>Git + CloudFormation"]
RDS --> TEST["🧪 Quarterly Restore Test"]
S3 --> TEST
IaC --> TEST
TEST --> MONITOR["📈 Monitoring & Alerts"]
style CRITICAL fill:#D32F2F
style HIGH fill:#FF9800
style MEDIUM fill:#FDD835
style STANDARD fill:#4CAF50
Backup Strategies
Full Backup Strategy
Definition: Complete copy of all data at a point in time.
Use Cases:
- Initial backup establishment
- Monthly comprehensive backups
- Before major system changes
- Regulatory compliance requirements
AWS Implementation:
# CloudFormation template for full database backup
Resources:
DatabaseFullBackupFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: cia-database-full-backup
Runtime: python3.12
Handler: index.lambda_handler
Role: !GetAtt BackupFunctionRole.Arn
Timeout: 900 # 15 minutes
Environment:
Variables:
RDS_INSTANCE_ID: !Ref CIADatabase
BACKUP_BUCKET: !Ref BackupBucket
Code:
ZipFile: |
import boto3
import datetime
import os
rds = boto3.client('rds')
def lambda_handler(event, context):
"""
Create full RDS snapshot
"""
instance_id = os.environ['RDS_INSTANCE_ID']
timestamp = datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
snapshot_id = f"{instance_id}-full-{timestamp}"
# Create snapshot
response = rds.create_db_snapshot(
DBSnapshotIdentifier=snapshot_id,
DBInstanceIdentifier=instance_id,
Tags=[
{'Key': 'BackupType', 'Value': 'Full'},
{'Key': 'CreatedBy', 'Value': 'Automated'},
{'Key': 'Retention', 'Value': '30days'}
]
)
print(f"Full backup initiated: {snapshot_id}")
return {
'statusCode': 200,
'body': snapshot_id
}
# Schedule monthly full backups
FullBackupSchedule:
Type: AWS::Events::Rule
Properties:
Name: cia-monthly-full-backup
Description: Monthly full database backup
ScheduleExpression: cron(0 2 1 * ? *) # 1st of month at 2 AM UTC
State: ENABLED
Targets:
- Arn: !GetAtt DatabaseFullBackupFunction.Arn
Id: FullBackupTarget
RTO and RPO Implementation
Recovery Time Objective (RTO)
Definition: Maximum acceptable time to restore service after an outage.
RTO Tiers:
| RTO Level | Time Window | Business Function Example | Implementation |
|---|---|---|---|
| Instant | <5 minutes | Financial transactions | Multi-AZ failover |
| Critical | 5-60 minutes | Core database | Automated failover |
| High | 1-4 hours | Application services | Blue-green deployment |
| Medium | 4-24 hours | Analytics systems | Manual restore from backup |
| Standard | >24 hours | Historical archives | Restore on demand |
RTO Configuration Example:
# Multi-AZ RDS for instant failover (RTO <5 minutes)
CIADatabase:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceIdentifier: cia-production-db
Engine: postgres
EngineVersion: "18.3"
DBInstanceClass: db.t3.medium
AllocatedStorage: 100
StorageType: gp3
StorageEncrypted: true
KmsKeyId: !Ref DatabaseEncryptionKey
# Multi-AZ for high availability (automatic failover)
MultiAZ: true
# Automated backups for point-in-time recovery
BackupRetentionPeriod: 35
PreferredBackupWindow: "03:00-04:00"
# Deletion protection
DeletionProtection: true
Tags:
- Key: RTO
Value: Critical-5to60min
- Key: RPO
Value: NearRealtime-1to15min
- Key: BusinessImpact
Value: VeryHigh
Recovery Point Objective (RPO)
Definition: Maximum acceptable data loss measured in time.
RPO Tiers:
| RPO Level | Data Loss Window | Business Function Example | Backup Frequency |
|---|---|---|---|
| Zero Loss | <1 minute | Financial records | Synchronous replication |
| Near Real-time | 1-15 minutes | Core database | Continuous backup |
| Minimal | 15-60 minutes | Application data | 15-minute snapshots |
| Hourly | 1-4 hours | User activity logs | Hourly backups |
| Daily | 4-24 hours | Analytics data | Daily backups |
| Extended | >24 hours | Archived data | Weekly backups |
Quarterly Restore Testing
Restore Test Procedure
Objective: Verify backup integrity and validate RTO/RPO targets.
Frequency: Quarterly minimum (ISO 27001 A.17.1.3)
Test Checklist:
- Select representative backup (full + incrementals)
- Restore to isolated test environment
- Verify data integrity (checksums, row counts)
- Test application functionality against restored data
- Measure actual recovery time vs RTO target
- Measure data loss vs RPO target
- Document results and lessons learned
- Update recovery procedures if needed
Automated Restore Test Script:
#!/bin/bash
# Quarterly backup restore test
# Tests RTO/RPO compliance and backup integrity
set -euo pipefail
TEST_DATE=$(date +%Y%m%d-%H%M%S)
TEST_REPORT="backup-restore-test-${TEST_DATE}.md"
TEST_INSTANCE="cia-restore-test-${TEST_DATE}"
log() {
echo "[$(date -u +"%Y-%m-%d %H:%M:%S UTC")] $*" | tee -a "${TEST_REPORT}"
}
# Start restore test
log "# Quarterly Backup Restore Test"
log ""
log "**Test Date**: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
log "**Tester**: CEO"
log "**Test Instance**: ${TEST_INSTANCE}"
log ""
# Step 1: Identify latest backup
log "## Step 1: Identify Latest Backup"
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
--db-instance-identifier cia-production-db \
--query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
--output text)
log "- Latest snapshot: ${SNAPSHOT_ID}"
SNAPSHOT_TIME=$(aws rds describe-db-snapshots \
--db-snapshot-identifier "${SNAPSHOT_ID}" \
--query 'DBSnapshots[0].SnapshotCreateTime' \
--output text)
log "- Snapshot time: ${SNAPSHOT_TIME}"
log ""
# Step 2: Restore snapshot to test instance
log "## Step 2: Restore Database"
START_TIME=$(date +%s)
log "- Initiating restore..."
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "${TEST_INSTANCE}" \
--db-snapshot-identifier "${SNAPSHOT_ID}" \
--db-instance-class db.t3.small \
--publicly-accessible false \
--no-multi-az \
--tags Key=Purpose,Value=RestoreTest Key=TestDate,Value="${TEST_DATE}"
# Wait for instance to be available
log "- Waiting for instance to become available..."
aws rds wait db-instance-available --db-instance-identifier "${TEST_INSTANCE}"
END_TIME=$(date +%s)
RESTORE_DURATION=$((END_TIME - START_TIME))
log "- ✅ Restore completed in ${RESTORE_DURATION} seconds"
log ""
# Step 3: Verify data integrity
log "## Step 3: Data Integrity Verification"
# Get endpoint
DB_ENDPOINT=$(aws rds describe-db-instances \
--db-instance-identifier "${TEST_INSTANCE}" \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
log "- Database endpoint: ${DB_ENDPOINT}"
# Verify row counts
log "- Verifying table row counts..."
psql -h "${DB_ENDPOINT}" -U cia_user -d cia_database -c "\
SELECT schemaname, tablename, n_live_tup as row_count \
FROM pg_stat_user_tables \
ORDER BY n_live_tup DESC \
LIMIT 10;" | tee -a "${TEST_REPORT}"
# Step 4: Cleanup
log "## Step 4: Cleanup"
log "- Deleting test instance..."
aws rds delete-db-instance \
--db-instance-identifier "${TEST_INSTANCE}" \
--skip-final-snapshot \
--delete-automated-backups
log "- ✅ Test instance cleanup initiated"
log ""
# Summary
log "## Test Summary"
log ""
log "| Metric | Target | Actual | Status |"
log "|--------|--------|--------|--------|"
log "| RTO | <4 hours | $(($RESTORE_DURATION / 60)) minutes | ✅ Pass |"
log "| Data Integrity | 100% | Verified | ✅ Pass |"
log ""
echo "✅ Restore test completed. Report: ${TEST_REPORT}"
Retention Policies
Retention by Classification
| Classification | Retention Period | Rationale | Disposal Method |
|---|---|---|---|
| RESTRICTED | Minimum required | Compliance, immediate disposal after expiry | Secure deletion (multi-pass overwrite) |
| CONFIDENTIAL | 7 years | Financial/legal requirements (Swedish law) | Secure deletion with audit trail |
| INTERNAL | 3 years | Operational history | Standard deletion |
| PUBLIC | Indefinite | Historical value, public interest | Standard deletion (if needed) |
S3 Lifecycle Policy
BackupBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: cia-backups
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: aws:kms
KMSMasterKeyID: !Ref BackupEncryptionKey
VersioningConfiguration:
Status: Enabled
LifecycleConfiguration:
Rules:
# CONFIDENTIAL financial data: 7-year retention
- Id: ConfidentialFinancialRetention
Status: Enabled
Prefix: confidential/financial/
ExpirationInDays: 2555 # 7 years
NoncurrentVersionExpirationInDays: 90
Transitions:
- TransitionInDays: 90
StorageClass: STANDARD_IA
- TransitionInDays: 365
StorageClass: GLACIER
# INTERNAL data: 3-year retention
- Id: InternalDataRetention
Status: Enabled
Prefix: internal/
ExpirationInDays: 1095 # 3 years
NoncurrentVersionExpirationInDays: 30
Transitions:
- TransitionInDays: 30
StorageClass: STANDARD_IA
- TransitionInDays: 180
StorageClass: GLACIER
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
Tags:
- Key: Purpose
Value: BackupStorage
- Key: DataClassification
Value: Mixed
Disaster Recovery Scenarios
Scenario 1: Database Corruption
Trigger: Application detects data integrity issues, corrupted records.
Recovery Procedure:
Immediate Actions (0-15 minutes)
- Stop application to prevent further corruption
- Identify corruption scope (affected tables, time range)
- Notify CEO and initiate incident response
Point-in-Time Recovery (15-60 minutes)
- Identify last known good state (before corruption)
- Restore RDS instance to point-in-time
- Validate data integrity in restored instance
Application Cutover (60-90 minutes)
- Update application configuration to new database endpoint
- Restart application services
- Verify application functionality
Post-Recovery (90+ minutes)
- Conduct root cause analysis
- Update monitoring to detect similar issues
- Document lessons learned
Expected RTO: 2 hours
Expected RPO: <15 minutes (point-in-time recovery)
Scenario 2: Complete AWS Region Failure
Trigger: AWS region unavailable, all services unreachable.
Recovery Procedure:
Immediate Actions (0-30 minutes)
- Declare disaster, activate business continuity plan
- Notify stakeholders (users, CEO, partners)
- Initiate cross-region recovery
Database Recovery (30-120 minutes)
- Restore latest cross-region snapshot in alternate region
- Configure network security groups and access
- Validate data integrity
Application Deployment (120-180 minutes)
- Deploy application to alternate region (CloudFormation)
- Configure DNS cutover to new region
- Update monitoring and alerting
Validation (180-240 minutes)
- End-to-end application testing
- Performance validation
- User communication and documentation
Expected RTO: 4 hours
Expected RPO: <4 hours (cross-region snapshot lag)
ISO 27001 Control Mapping
A.17.1.2 - Implementing Information Security Continuity
Control Objective: Organization shall establish, document, implement and maintain processes, procedures and controls to ensure the required level of continuity for information security during an adverse situation.
Implementation:
- ✅ Backup strategies documented by classification
- ✅ RTO/RPO targets defined and tested
- ✅ Disaster recovery procedures established
- ✅ Quarterly restore testing
A.17.1.3 - Verify, Review and Evaluate Information Security Continuity
Control Objective: Organization shall verify established and implemented information security continuity controls at regular intervals.
Implementation:
- ✅ Quarterly backup restore tests
- ✅ Automated backup validation
- ✅ Annual disaster recovery exercise
- ✅ Continuous monitoring of backup status
NIST Cybersecurity Framework Mapping
PR.IP-4: Backups of information conducted, maintained, tested
- ✅ Automated backups per classification
- ✅ Immutable backups (S3 versioning, RDS snapshots)
- ✅ Quarterly restore testing
RC.RP-1: Recovery plan executed during or after incident
- ✅ Disaster recovery procedures documented
- ✅ RTO/RPO targets defined
- ✅ Recovery scenarios tested
CIS Controls Mapping
CIS Control 11: Data Recovery
- 11.1: Establish and Maintain Data Recovery Process - ✅ Documented procedures
- 11.2: Perform Automated Backups - ✅ AWS automated snapshots
- 11.3: Protect Recovery Data - ✅ Encryption, immutability
- 11.4: Establish and Maintain Isolated Instance of Recovery Data - ✅ Cross-region backups
- 11.5: Test Data Recovery - ✅ Quarterly restore tests
Practical Implementation Checklist
For New Systems
- Classify data per classification framework
- Define RTO/RPO targets based on business impact
- Configure automated backups (RDS, S3, EBS)
- Enable cross-region replication for critical data
- Set up backup monitoring and alerting
- Document recovery procedures
- Schedule first restore test (within 30 days)
For Existing Systems
- Audit current backup coverage
- Verify RTO/RPO alignment with business needs
- Test actual recovery time vs targets
- Implement missing backups
- Enable backup encryption
- Configure lifecycle policies for retention
- Conduct quarterly restore testing
Related Policies
- Backup and Recovery Policy - Detailed backup requirements
- Business Continuity Plan - BCP integration
- Classification Framework - Data classification and RTO/RPO tiers
- Information Security Policy - Overall security framework
References
- ISO 27001:2022 - Annex A.17 Information Security Aspects of Business Continuity Management
- NIST SP 800-34 - Contingency Planning Guide
- CIS Controls v8 - Control 11: Data Recovery
- AWS Well-Architected Framework - Reliability Pillar