AWS DR Strategies: Complete Guide from Backup & Restore to Active-Active
Minimize cost with Backup & Restore, achieve instant recovery with Active-Active. Compare RTO/RPO and costs of 4 disaster recovery strategies for SAA-C03 exam.
Related Exam Domains
- Domain 2: Design Resilient Architectures
Key Takeaway
If cost is priority, choose Backup & Restore (RTO 24 hours). For fast recovery, choose Warm Standby (RTO minutes). For zero downtime, choose Active-Active (RTO ~0). DR strategy is a tradeoff between RTO/RPO requirements and cost.
Exam Tip
Exam Essential: Backup & Restore (lowest cost, longest RTO) → Pilot Light (core only running) → Warm Standby (scaled-down) → Active-Active (zero downtime, highest cost)
DR Strategies at a Glance
| Strategy | RPO | RTO | Cost | Complexity | Use Case |
|---|---|---|---|---|---|
| Backup & Restore | Hours | Under 24 hours | Lowest | Low | Non-critical systems |
| Pilot Light | Minutes | Tens of minutes | Low | Medium | Important systems |
| Warm Standby | Minutes | Minutes | Medium | Medium | Core systems |
| Active-Active | ~0 | ~0 | Highest | High | Mission-critical |
DR Strategy Spectrum:
Low Cost ←────────────────────────────────────→ High Cost
Long RTO ←────────────────────────────────────→ Short RTO
Backup & Pilot Warm Active-
Restore Light Standby Active
│ │ │ │
↓ ↓ ↓ ↓
24 hours Tens of min Minutes ~0
1. Backup & Restore
What is Backup & Restore?
A strategy where data is backed up regularly, and during disaster, new infrastructure is created and restored from backup. Most cost-effective but has the longest RTO.
Backup & Restore Architecture:
Normal Operation:
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ EC2, RDS │ │ │
│ Running │ ─── Backup ──→│ S3 backup │
│ │ │ only │
└─────────────┘ └─────────────┘
During Disaster:
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ ✕ │ │ New infra │
│ Failed │ │ Create & │
│ │ │ Restore │
└─────────────┘ └─────────────┘
Core Components
| Component | Description |
|---|---|
| S3 Cross-Region Replication | Auto-replicate backup data |
| EBS Snapshots | Volume backups, copy to other regions |
| RDS Automated Backup | Snapshots + transaction logs |
| AWS Backup | Centralized backup management |
| Infrastructure as Code | CloudFormation, Terraform |
Recovery Process
Recovery Steps During Disaster:
1. Detect disaster and declare DR
↓
2. Provision infrastructure with IaC (VPC, EC2, RDS, etc.)
↓
3. Restore data from latest backup
↓
4. Deploy and configure applications
↓
5. Update DNS (Route 53)
↓
6. Verify and resume service
Total time: Several hours to 24 hours
Suitable Use Cases
- Non-critical systems: Internal tools, dev environments
- Cost priority: Limited budget
- Low RTO tolerance: Can accept 24-hour downtime
- Compliance: Only data backup required
Exam Tip
Exam Point: Backup & Restore appears with keywords cost minimization + RTO 24 hours or less
2. Pilot Light
What is Pilot Light?
A strategy where only core infrastructure runs minimally. Database stays synchronized, compute resources start only during disaster.
Pilot Light Architecture:
Normal Operation:
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ EC2 (Running)│ │ EC2 (Stopped)│
│ RDS Primary │ ─── Repl. ──→ │ RDS Replica │
│ (Read/Write)│ (Async) │ (Read Only) │
└─────────────┘ └─────────────┘
During Disaster:
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ ✕ │ │ EC2 Start │
│ Failed │ │ RDS Promote │
│ │ │ (to Primary)│
└─────────────┘ └─────────────┘
Core Components
| Component | Normal State | Disaster Action |
|---|---|---|
| Database | Replica running | Promote to Primary |
| EC2 Instances | Stopped or not created | Start or create |
| AMI | Keep updated | Used to start EC2 |
| Network | VPC, subnets configured | Ready to use |
Difference from Backup & Restore
Key Differences:
Backup & Restore:
├── Data: Only backups stored in S3
├── Infrastructure: None (created during disaster)
└── RTO: 24 hours (infra creation + restore)
Pilot Light:
├── Data: DB replica synced in real-time
├── Infrastructure: Core only standby (DB, network)
└── RTO: Tens of minutes (EC2 start + DB promote)
Suitable Use Cases
- Important business systems: Need recovery in tens of minutes
- Cost/recovery balance: Cheaper than Warm Standby
- Predictable failures: Planned DR procedures possible
Exam Tip
Exam Point: Pilot Light = only core systems running + RTO tens of minutes + DB replica maintained
3. Warm Standby
What is Warm Standby?
A strategy where a scaled-down version of the full environment runs in the recovery region. All components are running but at minimum capacity.
Warm Standby Architecture:
Normal Operation:
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ EC2: 10 │ │ EC2: 2 │
│ (Full cap) │ │ (Min cap) │
│ │ │ │
│ RDS Primary │ ─── Repl. ──→ │ RDS Replica │
│ │ │ │
│ 100% Traffic│ │ 0% Traffic │
└─────────────┘ └─────────────┘
During Disaster:
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ ✕ │ │ EC2: 10 │
│ Failed │ │ (Scale up) │
│ │ │ │
│ │ │ RDS Primary │
│ │ │ (Promoted) │
│ │ │ 100% Traffic│
└─────────────┘ └─────────────┘
Pilot Light vs Warm Standby
| Comparison | Pilot Light | Warm Standby |
|---|---|---|
| EC2 State | Stopped/not created | Running (minimum) |
| Recovery Action | Start + deploy | Scale up only |
| RTO | Tens of minutes | Minutes |
| Cost | Low | Medium |
| Testing | Complex | Easy (already running) |
Core Components
| Component | Normal | Disaster |
|---|---|---|
| Auto Scaling | Min capacity (e.g., 2) | Desired capacity (e.g., 10) |
| RDS | Read Replica | Promote to Primary |
| ELB | Active (min traffic) | Full traffic |
| Route 53 | Weight 0% | Weight 100% |
Suitable Use Cases
- Core business systems: Need recovery in minutes
- Regular testing needed: Verify DR environment periodically
- Budget available: Higher cost than Pilot Light acceptable
Exam Tip
Exam Point: Warm Standby = scaled-down environment running + only scale up needed + RTO minutes
4. Active-Active (Multi-Site)
What is Active-Active?
A strategy where multiple regions handle traffic simultaneously. When one region fails, others immediately handle the full load.
Active-Active Architecture:
Normal Operation:
Route 53 (50% / 50% distribution)
│
┌─────────┴─────────┐
↓ ↓
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ EC2 (Full) │ │ EC2 (Full) │
│ 50% Traffic │ │ 50% Traffic │
│ │ │ │
│ DynamoDB │←──→│ DynamoDB │
│ Global Table│Repl│ Global Table│
└─────────────┘ └─────────────┘
During Disaster:
Route 53 (0% / 100% distribution)
│
┌─────────┴─────────┐
↓ ↓
Primary Region Recovery Region
┌─────────────┐ ┌─────────────┐
│ ✕ │ │ EC2 (Full) │
│ Failed │ │ 100% Traffic│
│ │ │ │
│ │ │ DynamoDB │
│ │ │ (Continues) │
└─────────────┘ └─────────────┘
Core Components
| Component | Description |
|---|---|
| Route 53 | Latency/weighted-based routing |
| DynamoDB Global Tables | Multi-region multi-master |
| Aurora Global Database | Sub-second cross-region replication |
| S3 Cross-Region Replication | Bidirectional replication |
| Global Accelerator | Global traffic distribution |
Data Synchronization Considerations
Active-Active Data Sync Challenges:
1. Write Conflicts
- Same record modified in two regions simultaneously
- Solution: Last Writer Wins, version control
2. Consistency
- Replication delay between regions
- Solution: Accept eventual consistency, pin read region
3. Transactions
- Distributed transactions difficult
- Solution: Use transactions within region only
Suitable Use Cases
- Mission-critical systems: Zero downtime tolerance
- Global services: Serve users from nearest region
- Regulatory requirements: Zero RTO/RPO mandatory
- Sufficient budget: 2x infrastructure cost
Exam Tip
Exam Point: Active-Active = RTO/RPO ~0 + both sides handle traffic + data conflict management needed
Strategy Selection Guide
Selection by Requirements
DR Strategy Decision Tree:
RTO requirement?
│
├── 24+ hours acceptable → Backup & Restore
│
├── Tens of minutes needed → Pilot Light
│
├── Minutes needed → Warm Standby
│
└── Zero downtime → Active-Active
Cost vs Recovery Time Analysis
Monthly Cost Example (base environment $10,000/month):
Backup & Restore:
├── Additional cost: ~$500/month (S3 backup + snapshots)
└── RTO: 24 hours
Pilot Light:
├── Additional cost: ~$2,000/month (DB replica + min infra)
└── RTO: 30 minutes
Warm Standby:
├── Additional cost: ~$5,000/month (scaled-down always running)
└── RTO: 5 minutes
Active-Active:
├── Additional cost: ~$10,000/month (full environment 2x)
└── RTO: ~0
AWS Services for DR
AWS Services by Strategy
| Strategy | Compute | Database | Storage | Networking |
|---|---|---|---|---|
| Backup | AMI | RDS Snapshots | S3 CRR | - |
| Pilot Light | AMI (standby) | RDS Read Replica | S3 CRR | VPC configured |
| Warm Standby | ASG (min) | RDS Read Replica | S3 CRR | ELB |
| Active-Active | ASG (full) | Aurora Global, DynamoDB Global | S3 CRR | Route 53 |
AWS Elastic Disaster Recovery
AWS Elastic Disaster Recovery (DRS):
Features:
├── Continuous block-level replication
├── Sub-second RPO
├── Minutes RTO
├── Pilot Light cost for Warm Standby-level recovery
└── Automated failover
Suitable for:
├── On-premises → AWS DR
├── AWS region-to-region DR
└── Cost-effective DR needed
SAA-C03 Exam Focus Points
- ✅ RTO/RPO by strategy: Backup (hours), Pilot Light (tens of min), Warm Standby (min), Active-Active (~0)
- ✅ Cost order: Backup < Pilot Light < Warm Standby < Active-Active
- ✅ Pilot Light vs Warm Standby: Core only running vs scaled-down full environment
- ✅ Active-Active data: DynamoDB Global Tables, Aurora Global Database
- ✅ Route 53 role: Failover routing, health checks
- ✅ AWS DRS: Pilot Light cost for low RTO
Exam Tip
Sample Exam Question: "A company has DR requirements of RTO 1 hour, RPO 15 minutes. What DR strategy minimizes cost while meeting requirements?" → Answer: Pilot Light (RTO tens of minutes, RPO minutes, cheaper than Warm Standby)
Frequently Asked Questions
Q: What's the biggest difference between Pilot Light and Warm Standby?
Whether EC2 is running. Pilot Light runs only database with EC2 stopped. Warm Standby runs all components at minimum capacity. Therefore, Warm Standby has shorter RTO but higher cost.
Q: How are data conflicts resolved in Active-Active?
Last Writer Wins or conflict resolution logic is needed. DynamoDB Global Tables use timestamp-based last writer selection. At application level, limiting conflicting writes to specific regions is also an approach.
Q: How do you test DR strategies?
Conduct regular DR drills. Simulate actual failover and verify RTO/RPO goals are met. Warm Standby and Active-Active are easier to test as they're already running.
Q: How can RTO be reduced in Backup & Restore?
Use automation and IaC. Codify infrastructure with CloudFormation/Terraform, configure automated backups with AWS Backup, and automate recovery procedures to reduce RTO to several hours.
Q: Are hybrid DR strategies possible?
Yes, different strategies can be applied per workload. Core systems can use Warm Standby while secondary systems use Backup & Restore to balance cost and recovery time.
Q: What about data transfer costs for multi-region DR?
Cross-region data transfer incurs costs. S3 CRR, RDS cross-region replication, DynamoDB Global Tables all charge for inter-region data transfer. Active-Active has highest costs due to bidirectional sync.