SAABlog
DatabaseAdvanced

AWS DR Strategies: Complete Guide from Backup & Restore to Active-Active

Minimize cost with Backup & Restore, achieve instant recovery with Active-Active. Compare RTO/RPO and costs of 4 disaster recovery strategies for SAA-C03 exam.

PHILOLAMB-Updated: January 31, 2026
DRDisaster RecoveryBackupPilot LightWarm StandbyActive-Active

Related Exam Domains

  • Domain 2: Design Resilient Architectures

Key Takeaway

If cost is priority, choose Backup & Restore (RTO 24 hours). For fast recovery, choose Warm Standby (RTO minutes). For zero downtime, choose Active-Active (RTO ~0). DR strategy is a tradeoff between RTO/RPO requirements and cost.

Exam Tip

Exam Essential: Backup & Restore (lowest cost, longest RTO) → Pilot Light (core only running) → Warm Standby (scaled-down) → Active-Active (zero downtime, highest cost)

DR Strategies at a Glance

StrategyRPORTOCostComplexityUse Case
Backup & RestoreHoursUnder 24 hoursLowestLowNon-critical systems
Pilot LightMinutesTens of minutesLowMediumImportant systems
Warm StandbyMinutesMinutesMediumMediumCore systems
Active-Active~0~0HighestHighMission-critical
DR Strategy Spectrum:

Low Cost ←────────────────────────────────────→ High Cost
Long RTO ←────────────────────────────────────→ Short RTO

Backup &     Pilot        Warm          Active-
Restore      Light        Standby       Active
   │           │            │              │
   ↓           ↓            ↓              ↓
 24 hours   Tens of min    Minutes        ~0

1. Backup & Restore

What is Backup & Restore?

A strategy where data is backed up regularly, and during disaster, new infrastructure is created and restored from backup. Most cost-effective but has the longest RTO.

Backup & Restore Architecture:

Normal Operation:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│ EC2, RDS    │               │             │
│ Running     │ ─── Backup ──→│ S3 backup   │
│             │               │ only        │
└─────────────┘               └─────────────┘

During Disaster:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│     ✕       │               │ New infra   │
│   Failed    │               │ Create &    │
│             │               │ Restore     │
└─────────────┘               └─────────────┘

Core Components

ComponentDescription
S3 Cross-Region ReplicationAuto-replicate backup data
EBS SnapshotsVolume backups, copy to other regions
RDS Automated BackupSnapshots + transaction logs
AWS BackupCentralized backup management
Infrastructure as CodeCloudFormation, Terraform

Recovery Process

Recovery Steps During Disaster:

1. Detect disaster and declare DR
   ↓
2. Provision infrastructure with IaC (VPC, EC2, RDS, etc.)
   ↓
3. Restore data from latest backup
   ↓
4. Deploy and configure applications
   ↓
5. Update DNS (Route 53)
   ↓
6. Verify and resume service

Total time: Several hours to 24 hours

Suitable Use Cases

  • Non-critical systems: Internal tools, dev environments
  • Cost priority: Limited budget
  • Low RTO tolerance: Can accept 24-hour downtime
  • Compliance: Only data backup required

Exam Tip

Exam Point: Backup & Restore appears with keywords cost minimization + RTO 24 hours or less

2. Pilot Light

What is Pilot Light?

A strategy where only core infrastructure runs minimally. Database stays synchronized, compute resources start only during disaster.

Pilot Light Architecture:

Normal Operation:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│ EC2 (Running)│               │ EC2 (Stopped)│
│ RDS Primary │ ─── Repl. ──→ │ RDS Replica │
│ (Read/Write)│   (Async)     │ (Read Only) │
└─────────────┘               └─────────────┘

During Disaster:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│     ✕       │               │ EC2 Start   │
│   Failed    │               │ RDS Promote │
│             │               │ (to Primary)│
└─────────────┘               └─────────────┘

Core Components

ComponentNormal StateDisaster Action
DatabaseReplica runningPromote to Primary
EC2 InstancesStopped or not createdStart or create
AMIKeep updatedUsed to start EC2
NetworkVPC, subnets configuredReady to use

Difference from Backup & Restore

Key Differences:

Backup & Restore:
├── Data: Only backups stored in S3
├── Infrastructure: None (created during disaster)
└── RTO: 24 hours (infra creation + restore)

Pilot Light:
├── Data: DB replica synced in real-time
├── Infrastructure: Core only standby (DB, network)
└── RTO: Tens of minutes (EC2 start + DB promote)

Suitable Use Cases

  • Important business systems: Need recovery in tens of minutes
  • Cost/recovery balance: Cheaper than Warm Standby
  • Predictable failures: Planned DR procedures possible

Exam Tip

Exam Point: Pilot Light = only core systems running + RTO tens of minutes + DB replica maintained

3. Warm Standby

What is Warm Standby?

A strategy where a scaled-down version of the full environment runs in the recovery region. All components are running but at minimum capacity.

Warm Standby Architecture:

Normal Operation:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│ EC2: 10     │               │ EC2: 2      │
│ (Full cap)  │               │ (Min cap)   │
│             │               │             │
│ RDS Primary │ ─── Repl. ──→ │ RDS Replica │
│             │               │             │
│ 100% Traffic│               │ 0% Traffic  │
└─────────────┘               └─────────────┘

During Disaster:
Primary Region                 Recovery Region
┌─────────────┐               ┌─────────────┐
│     ✕       │               │ EC2: 10     │
│   Failed    │               │ (Scale up)  │
│             │               │             │
│             │               │ RDS Primary │
│             │               │ (Promoted)  │
│             │               │ 100% Traffic│
└─────────────┘               └─────────────┘

Pilot Light vs Warm Standby

ComparisonPilot LightWarm Standby
EC2 StateStopped/not createdRunning (minimum)
Recovery ActionStart + deployScale up only
RTOTens of minutesMinutes
CostLowMedium
TestingComplexEasy (already running)

Core Components

ComponentNormalDisaster
Auto ScalingMin capacity (e.g., 2)Desired capacity (e.g., 10)
RDSRead ReplicaPromote to Primary
ELBActive (min traffic)Full traffic
Route 53Weight 0%Weight 100%

Suitable Use Cases

  • Core business systems: Need recovery in minutes
  • Regular testing needed: Verify DR environment periodically
  • Budget available: Higher cost than Pilot Light acceptable

Exam Tip

Exam Point: Warm Standby = scaled-down environment running + only scale up needed + RTO minutes

4. Active-Active (Multi-Site)

What is Active-Active?

A strategy where multiple regions handle traffic simultaneously. When one region fails, others immediately handle the full load.

Active-Active Architecture:

Normal Operation:
      Route 53 (50% / 50% distribution)
              │
    ┌─────────┴─────────┐
    ↓                   ↓
Primary Region         Recovery Region
┌─────────────┐    ┌─────────────┐
│ EC2 (Full)  │    │ EC2 (Full)  │
│ 50% Traffic │    │ 50% Traffic │
│             │    │             │
│ DynamoDB    │←──→│ DynamoDB    │
│ Global Table│Repl│ Global Table│
└─────────────┘    └─────────────┘

During Disaster:
      Route 53 (0% / 100% distribution)
              │
    ┌─────────┴─────────┐
    ↓                   ↓
Primary Region         Recovery Region
┌─────────────┐    ┌─────────────┐
│     ✕       │    │ EC2 (Full)  │
│   Failed    │    │ 100% Traffic│
│             │    │             │
│             │    │ DynamoDB    │
│             │    │ (Continues) │
└─────────────┘    └─────────────┘

Core Components

ComponentDescription
Route 53Latency/weighted-based routing
DynamoDB Global TablesMulti-region multi-master
Aurora Global DatabaseSub-second cross-region replication
S3 Cross-Region ReplicationBidirectional replication
Global AcceleratorGlobal traffic distribution

Data Synchronization Considerations

Active-Active Data Sync Challenges:

1. Write Conflicts
   - Same record modified in two regions simultaneously
   - Solution: Last Writer Wins, version control

2. Consistency
   - Replication delay between regions
   - Solution: Accept eventual consistency, pin read region

3. Transactions
   - Distributed transactions difficult
   - Solution: Use transactions within region only

Suitable Use Cases

  • Mission-critical systems: Zero downtime tolerance
  • Global services: Serve users from nearest region
  • Regulatory requirements: Zero RTO/RPO mandatory
  • Sufficient budget: 2x infrastructure cost

Exam Tip

Exam Point: Active-Active = RTO/RPO ~0 + both sides handle traffic + data conflict management needed

Strategy Selection Guide

Selection by Requirements

DR Strategy Decision Tree:

RTO requirement?
    │
    ├── 24+ hours acceptable → Backup & Restore
    │
    ├── Tens of minutes needed → Pilot Light
    │
    ├── Minutes needed → Warm Standby
    │
    └── Zero downtime → Active-Active

Cost vs Recovery Time Analysis

Monthly Cost Example (base environment $10,000/month):

Backup & Restore:
├── Additional cost: ~$500/month (S3 backup + snapshots)
└── RTO: 24 hours

Pilot Light:
├── Additional cost: ~$2,000/month (DB replica + min infra)
└── RTO: 30 minutes

Warm Standby:
├── Additional cost: ~$5,000/month (scaled-down always running)
└── RTO: 5 minutes

Active-Active:
├── Additional cost: ~$10,000/month (full environment 2x)
└── RTO: ~0

AWS Services for DR

AWS Services by Strategy

StrategyComputeDatabaseStorageNetworking
BackupAMIRDS SnapshotsS3 CRR-
Pilot LightAMI (standby)RDS Read ReplicaS3 CRRVPC configured
Warm StandbyASG (min)RDS Read ReplicaS3 CRRELB
Active-ActiveASG (full)Aurora Global, DynamoDB GlobalS3 CRRRoute 53

AWS Elastic Disaster Recovery

AWS Elastic Disaster Recovery (DRS):

Features:
├── Continuous block-level replication
├── Sub-second RPO
├── Minutes RTO
├── Pilot Light cost for Warm Standby-level recovery
└── Automated failover

Suitable for:
├── On-premises → AWS DR
├── AWS region-to-region DR
└── Cost-effective DR needed

SAA-C03 Exam Focus Points

  1. RTO/RPO by strategy: Backup (hours), Pilot Light (tens of min), Warm Standby (min), Active-Active (~0)
  2. Cost order: Backup < Pilot Light < Warm Standby < Active-Active
  3. Pilot Light vs Warm Standby: Core only running vs scaled-down full environment
  4. Active-Active data: DynamoDB Global Tables, Aurora Global Database
  5. Route 53 role: Failover routing, health checks
  6. AWS DRS: Pilot Light cost for low RTO

Exam Tip

Sample Exam Question: "A company has DR requirements of RTO 1 hour, RPO 15 minutes. What DR strategy minimizes cost while meeting requirements?" → Answer: Pilot Light (RTO tens of minutes, RPO minutes, cheaper than Warm Standby)

Frequently Asked Questions

Q: What's the biggest difference between Pilot Light and Warm Standby?

Whether EC2 is running. Pilot Light runs only database with EC2 stopped. Warm Standby runs all components at minimum capacity. Therefore, Warm Standby has shorter RTO but higher cost.

Q: How are data conflicts resolved in Active-Active?

Last Writer Wins or conflict resolution logic is needed. DynamoDB Global Tables use timestamp-based last writer selection. At application level, limiting conflicting writes to specific regions is also an approach.

Q: How do you test DR strategies?

Conduct regular DR drills. Simulate actual failover and verify RTO/RPO goals are met. Warm Standby and Active-Active are easier to test as they're already running.

Q: How can RTO be reduced in Backup & Restore?

Use automation and IaC. Codify infrastructure with CloudFormation/Terraform, configure automated backups with AWS Backup, and automate recovery procedures to reduce RTO to several hours.

Q: Are hybrid DR strategies possible?

Yes, different strategies can be applied per workload. Core systems can use Warm Standby while secondary systems use Backup & Restore to balance cost and recovery time.

Q: What about data transfer costs for multi-region DR?

Cross-region data transfer incurs costs. S3 CRR, RDS cross-region replication, DynamoDB Global Tables all charge for inter-region data transfer. Active-Active has highest costs due to bidirectional sync.



References