AWS DR Strategies: Complete Guide from Backup & Restore to Active-Active

Key Takeaway

If cost is priority, choose Backup & Restore (RTO 24 hours). For fast recovery, choose Warm Standby (RTO minutes). For zero downtime, choose Active-Active (RTO ~0). DR strategy is a tradeoff between RTO/RPO requirements and cost.

Exam Tip

Exam Essential: Backup & Restore (lowest cost, longest RTO) → Pilot Light (core only running) → Warm Standby (scaled-down) → Active-Active (zero downtime, highest cost)

DR Strategies at a Glance

Strategy	RPO	RTO	Cost	Complexity	Use Case
Backup & Restore	Hours	Under 24 hours	Lowest	Low	Non-critical systems
Pilot Light	Minutes	Tens of minutes	Low	Medium	Important systems
Warm Standby	Minutes	Minutes	Medium	Medium	Core systems
Active-Active	~0	~0	Highest	High	Mission-critical

DR Strategy Spectrum:

Low Cost ←────────────────────────────────────→ High Cost
Long RTO ←────────────────────────────────────→ Short RTO

Backup &     Pilot        Warm          Active-
Restore      Light        Standby       Active
   │           │            │              │
   ↓           ↓            ↓              ↓
 24 hours   Tens of min    Minutes        ~0

1. Backup & Restore

What is Backup & Restore?

A strategy where data is backed up regularly, and during disaster, new infrastructure is created and restored from backup. Most cost-effective but has the longest RTO.

Backup & Restore Architecture:

Normal Operation:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│ EC2, RDS    │               │             │
│ Running     │ ─── Backup ──→│ S3 backup   │
│             │               │ only        │
└─────────────┘               └─────────────┘

During Disaster:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│     ✕       │               │ New infra   │
│   Failed    │               │ Create &    │
│             │               │ Restore     │
└─────────────┘               └─────────────┘

Core Components

Component	Description
S3 Cross-Region Replication	Auto-replicate backup data
EBS Snapshots	Volume backups, copy to other regions
RDS Automated Backup	Snapshots + transaction logs
AWS Backup	Centralized backup management
Infrastructure as Code	CloudFormation, Terraform

Recovery Process

Recovery Steps During Disaster:

1. Detect disaster and declare DR
   ↓
2. Provision infrastructure with IaC (VPC, EC2, RDS, etc.)
   ↓
3. Restore data from latest backup
   ↓
4. Deploy and configure applications
   ↓
5. Update DNS (Route 53)
   ↓
6. Verify and resume service

Total time: Several hours to 24 hours

Suitable Use Cases

Non-critical systems: Internal tools, dev environments
Cost priority: Limited budget
Low RTO tolerance: Can accept 24-hour downtime
Compliance: Only data backup required

Exam Tip

Exam Point: Backup & Restore appears with keywords cost minimization + RTO 24 hours or less

2. Pilot Light

What is Pilot Light?

A strategy where only core infrastructure runs minimally. Database stays synchronized, compute resources start only during disaster.

Pilot Light Architecture:

Normal Operation:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│ EC2 (Running)│               │ EC2 (Stopped)│
│ RDS Primary │ ─── Repl. ──→ │ RDS Replica │
│ (Read/Write)│   (Async)     │ (Read Only) │
└─────────────┘               └─────────────┘

During Disaster:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│     ✕       │               │ EC2 Start   │
│   Failed    │               │ RDS Promote │
│             │               │ (to Primary)│
└─────────────┘               └─────────────┘

Core Components

Component	Normal State	Disaster Action
Database	Replica running	Promote to Primary
EC2 Instances	Stopped or not created	Start or create
AMI	Keep updated	Used to start EC2
Network	VPC, subnets configured	Ready to use

Difference from Backup & Restore

Key Differences:

Backup & Restore:
├── Data: Only backups stored in S3
├── Infrastructure: None (created during disaster)
└── RTO: 24 hours (infra creation + restore)

Pilot Light:
├── Data: DB replica synced in real-time
├── Infrastructure: Core only standby (DB, network)
└── RTO: Tens of minutes (EC2 start + DB promote)

Suitable Use Cases

Important business systems: Need recovery in tens of minutes
Cost/recovery balance: Cheaper than Warm Standby
Predictable failures: Planned DR procedures possible

Exam Tip

Exam Point: Pilot Light = only core systems running + RTO tens of minutes + DB replica maintained

3. Warm Standby

What is Warm Standby?

A strategy where a scaled-down version of the full environment runs in the recovery region. All components are running but at minimum capacity.

Warm Standby Architecture:

Normal Operation:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│ EC2: 10     │               │ EC2: 2      │
│ (Full cap)  │               │ (Min cap)   │
│             │               │             │
│ RDS Primary │ ─── Repl. ──→ │ RDS Replica │
│             │               │             │
│ 100% Traffic│               │ 0% Traffic  │
└─────────────┘               └─────────────┘

During Disaster:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│     ✕       │               │ EC2: 10     │
│   Failed    │               │ (Scale up)  │
│             │               │             │
│             │               │ RDS Primary │
│             │               │ (Promoted)  │
│             │               │ 100% Traffic│
└─────────────┘               └─────────────┘

Pilot Light vs Warm Standby

Comparison	Pilot Light	Warm Standby
EC2 State	Stopped/not created	Running (minimum)
Recovery Action	Start + deploy	Scale up only
RTO	Tens of minutes	Minutes
Cost	Low	Medium
Testing	Complex	Easy (already running)

Core Components

Component	Normal	Disaster
Auto Scaling	Min capacity (e.g., 2)	Desired capacity (e.g., 10)
RDS	Read Replica	Promote to Primary
ELB	Active (min traffic)	Full traffic
Route 53	Weight 0%	Weight 100%

Suitable Use Cases

Core business systems: Need recovery in minutes
Regular testing needed: Verify DR environment periodically
Budget available: Higher cost than Pilot Light acceptable

Exam Tip

Exam Point: Warm Standby = scaled-down environment running + only scale up needed + RTO minutes

4. Active-Active (Multi-Site)

What is Active-Active?

A strategy where multiple regions handle traffic simultaneously. When one region fails, others immediately handle the full load.

Active-Active Architecture:

Normal Operation:
      Route 53 (50% / 50% distribution)
              │
    ┌─────────┴─────────┐
    ↓                   ↓
Primary Region         Recovery Region
┌─────────────┐    ┌─────────────┐
│ EC2 (Full)  │    │ EC2 (Full)  │
│ 50% Traffic │    │ 50% Traffic │
│             │    │             │
│ DynamoDB    │←──→│ DynamoDB    │
│ Global Table│Repl│ Global Table│
└─────────────┘    └─────────────┘

During Disaster:
      Route 53 (0% / 100% distribution)
              │
    ┌─────────┴─────────┐
    ↓                   ↓
Primary Region         Recovery Region
┌─────────────┐    ┌─────────────┐
│     ✕       │    │ EC2 (Full)  │
│   Failed    │    │ 100% Traffic│
│             │    │             │
│             │    │ DynamoDB    │
│             │    │ (Continues) │
└─────────────┘    └─────────────┘

Core Components

Component	Description
Route 53	Latency/weighted-based routing
DynamoDB Global Tables	Multi-region multi-master
Aurora Global Database	Sub-second cross-region replication
S3 Cross-Region Replication	Bidirectional replication
Global Accelerator	Global traffic distribution

Data Synchronization Considerations

Active-Active Data Sync Challenges:

1. Write Conflicts
   - Same record modified in two regions simultaneously
   - Solution: Last Writer Wins, version control

2. Consistency
   - Replication delay between regions
   - Solution: Accept eventual consistency, pin read region

3. Transactions
   - Distributed transactions difficult
   - Solution: Use transactions within region only

Suitable Use Cases

Mission-critical systems: Zero downtime tolerance
Global services: Serve users from nearest region
Regulatory requirements: Zero RTO/RPO mandatory
Sufficient budget: 2x infrastructure cost

Exam Tip

Exam Point: Active-Active = RTO/RPO ~0 + both sides handle traffic + data conflict management needed

Strategy Selection Guide

Selection by Requirements

DR Strategy Decision Tree:

RTO requirement?
    │
    ├── 24+ hours acceptable → Backup & Restore
    │
    ├── Tens of minutes needed → Pilot Light
    │
    ├── Minutes needed → Warm Standby
    │
    └── Zero downtime → Active-Active

Cost vs Recovery Time Analysis

Monthly Cost Example (base environment $10,000/month):

Backup & Restore:
├── Additional cost: ~$500/month (S3 backup + snapshots)
└── RTO: 24 hours

Pilot Light:
├── Additional cost: ~$2,000/month (DB replica + min infra)
└── RTO: 30 minutes

Warm Standby:
├── Additional cost: ~$5,000/month (scaled-down always running)
└── RTO: 5 minutes

Active-Active:
├── Additional cost: ~$10,000/month (full environment 2x)
└── RTO: ~0

AWS Services for DR

AWS Services by Strategy

Strategy	Compute	Database	Storage	Networking
Backup	AMI	RDS Snapshots	S3 CRR	-
Pilot Light	AMI (standby)	RDS Read Replica	S3 CRR	VPC configured
Warm Standby	ASG (min)	RDS Read Replica	S3 CRR	ELB
Active-Active	ASG (full)	Aurora Global, DynamoDB Global	S3 CRR	Route 53

AWS Elastic Disaster Recovery

AWS Elastic Disaster Recovery (DRS):

Features:
├── Continuous block-level replication
├── Sub-second RPO
├── Minutes RTO
├── Pilot Light cost for Warm Standby-level recovery
└── Automated failover

Suitable for:
├── On-premises → AWS DR
├── AWS region-to-region DR
└── Cost-effective DR needed

SAA-C03 Exam Focus Points

✅ RTO/RPO by strategy: Backup (hours), Pilot Light (tens of min), Warm Standby (min), Active-Active (~0)
✅ Cost order: Backup < Pilot Light < Warm Standby < Active-Active
✅ Pilot Light vs Warm Standby: Core only running vs scaled-down full environment
✅ Active-Active data: DynamoDB Global Tables, Aurora Global Database
✅ Route 53 role: Failover routing, health checks
✅ AWS DRS: Pilot Light cost for low RTO

Exam Tip

Sample Exam Question: "A company has DR requirements of RTO 1 hour, RPO 15 minutes. What DR strategy minimizes cost while meeting requirements?" → Answer: Pilot Light (RTO tens of minutes, RPO minutes, cheaper than Warm Standby)

AWS DR Strategies: Complete Guide from Backup & Restore to Active-Active

Key Takeaway

DR Strategies at a Glance

1. Backup & Restore

What is Backup & Restore?

Core Components

Recovery Process

Suitable Use Cases

2. Pilot Light

What is Pilot Light?

Core Components

Difference from Backup & Restore

Suitable Use Cases

3. Warm Standby

What is Warm Standby?

Pilot Light vs Warm Standby

Core Components

Suitable Use Cases

4. Active-Active (Multi-Site)

What is Active-Active?

Core Components

Data Synchronization Considerations

Suitable Use Cases

Strategy Selection Guide

Selection by Requirements

Cost vs Recovery Time Analysis

AWS Services for DR

AWS Services by Strategy

AWS Elastic Disaster Recovery

SAA-C03 Exam Focus Points

Frequently Asked Questions

Q: What's the biggest difference between Pilot Light and Warm Standby?

Q: How are data conflicts resolved in Active-Active?

Q: How do you test DR strategies?

Q: How can RTO be reduced in Backup & Restore?

Q: Are hybrid DR strategies possible?

Q: What about data transfer costs for multi-region DR?

References