Multi-AZ vs Multi-Region: AWS High Availability and Disaster Recovery Guide
Compare Multi-AZ and Multi-Region architectures on AWS. Learn when to use Pilot Light, Warm Standby, or Active-Active DR strategies based on RTO/RPO requirements.
Related Exam Domains
- Domain 2: Design Resilient Architectures
Key Takeaway
Multi-AZ is for High Availability (HA), Multi-Region is for Disaster Recovery (DR). Use Multi-AZ for AZ failures; use Multi-Region for regional failures or geographic requirements.
Quick Comparison
| Aspect | Multi-AZ | Multi-Region |
|---|---|---|
| Purpose | High Availability | Disaster Recovery |
| Protection | AZ failure | Region failure, natural disasters |
| Failover Time | Automatic, 1-2 min | Manual/Auto, minutes to hours |
| Replication | Synchronous | Asynchronous |
| Latency | Milliseconds | Tens to hundreds of ms |
| Cost | ~2x baseline | 2x+ baseline |
| Complexity | Low | High |
Exam Tip
Exam Essential: "AZ failure protection" = Multi-AZ. "Region failure" or "natural disaster recovery" = Multi-Region. Consider cost and complexity trade-offs.
Multi-AZ Architecture
What is Multi-AZ?
Multi-AZ distributes resources across multiple Availability Zones within a single Region.
┌─────────────────────────────────────────────────┐
│ Seoul Region │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ AZ-a │ │ AZ-b │ │ AZ-c │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │
│ │ │ EC2 │ │ │ │ EC2 │ │ │ │ EC2 │ │
│ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ │
│ │ │ RDS │←─┼──┼─→│Standby│ │ │ │
│ │ │Primary│ │ │ │ RDS │ │ │ │
│ │ └───────┘ │ │ └───────┘ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘
└─────────────────────────────────────────────────┘
Multi-AZ Features
| Feature | Description |
|---|---|
| Synchronous replication | Real-time data sync between Primary and Standby |
| Automatic failover | Automatic switch to Standby on failure detection |
| Single endpoint | DNS name unchanged, only IP changes |
| Same region | Low latency, simple architecture |
Multi-AZ Supported Services
- ✅ RDS (automatic failover)
- ✅ ElastiCache (Redis cluster mode)
- ✅ EFS (Multi-AZ by default)
- ✅ Aurora (Multi-AZ by default)
- ✅ OpenSearch (Multi-AZ deployment)
- ✅ MSK (Multi-AZ recommended)
Multi-Region Architecture
What is Multi-Region?
Multi-Region distributes infrastructure across multiple AWS Regions.
┌──────────────────────┐ ┌──────────────────────┐
│ Seoul Region │ │ Tokyo Region │
│ (ap-northeast-2) │ │ (ap-northeast-1) │
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ Primary DB │──┼─ Repl ──┼─→│ Replica DB │ │
│ │ (Aurora) │ │ │ │ (Aurora) │ │
│ └────────────────┘ │ │ └────────────────┘ │
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ Application │ │ │ │ Application │ │
│ │ Servers │ │ │ │ (Standby) │ │
│ └────────────────┘ │ │ └────────────────┘ │
└──────────────────────┘ └──────────────────────┘
│ │
└──────────┬─────────────────┘
│
┌──────┴──────┐
│ Route 53 │
│ (Failover) │
└─────────────┘
Multi-Region Supported Services
- ✅ Aurora Global Database (under 1 second replication)
- ✅ DynamoDB Global Tables (millisecond replication)
- ✅ S3 Cross-Region Replication (CRR)
- ✅ Route 53 (Global DNS)
- ✅ CloudFront (Global CDN)
- ✅ Global Accelerator (Global network)
DR Strategy Comparison
Understanding RTO and RPO
| Metric | Definition | Question |
|---|---|---|
| RPO | Recovery Point Objective | "How much data can we afford to lose?" |
| RTO | Recovery Time Objective | "How quickly must we recover?" |
Disaster Occurs
│
▼
──────┬─────┼─────┬──────────────────→ Time
│ │ │
│◄───►│ │◄──────────────►│
│ RPO │ │ RTO │
│ │ │ │
Last Disaster Recovery Service
Backup Starts Restored
Exam Tip
Exam Tip: If RPO must be near "0", you need synchronous replication. If RTO must be near "0", you need Hot Standby or Active-Active.
The Four DR Strategies
1. Backup and Restore
The most basic DR strategy—periodically backup data and restore during disasters.
| Metric | Value |
|---|---|
| RTO | Hours to 24 hours |
| RPO | Hours (since last backup) |
| Cost | 💰 (Lowest) |
| Complexity | ⭐ (Simplest) |
Best for: Non-critical systems, dev/test environments, cost-sensitive workloads
2. Pilot Light
Keep only core systems running minimally in DR region; provision the rest during disaster.
| Metric | Value |
|---|---|
| RTO | Minutes to hours |
| RPO | Minutes (async replication lag) |
| Cost | 💰💰 (Low-Medium) |
| Complexity | ⭐⭐ (Medium) |
Best for: Core business systems, balanced cost vs recovery time
3. Warm Standby
Run a scaled-down version of the full system in DR region.
| Metric | Value |
|---|---|
| RTO | Minutes |
| RPO | Seconds to minutes |
| Cost | 💰💰💰 (Medium-High) |
| Complexity | ⭐⭐⭐ (High) |
Best for: Critical systems requiring fast recovery, some downtime acceptable
4. Multi-Site Active-Active
Both regions actively handle traffic simultaneously.
| Metric | Value |
|---|---|
| RTO | Near-zero |
| RPO | Near-zero |
| Cost | 💰💰💰💰 (Highest) |
| Complexity | ⭐⭐⭐⭐ (Highest) |
Best for: Mission-critical systems, zero downtime tolerance, global users
DR Strategy Summary
| Strategy | RTO | RPO | Cost | Automation |
|---|---|---|---|---|
| Backup/Restore | 24h+ | Hours | $ | Manual |
| Pilot Light | Hours | Minutes | $$ | Semi-auto |
| Warm Standby | Minutes | Seconds | $$$ | Auto |
| Active-Active | ~0 | ~0 | $$$$ | Fully auto |
AWS Services by DR Strategy
| Strategy | Database | Compute | Networking |
|---|---|---|---|
| Backup/Restore | RDS Snapshots + S3 CRR | AMI Copy | Route 53 manual |
| Pilot Light | Aurora Global (replica only) | EC2 AMIs (stopped) | Route 53 Failover |
| Warm Standby | Aurora Global (scaled down) | EC2 minimal running | Route 53 Failover |
| Active-Active | DynamoDB Global Tables | Full scale both sides | Route 53 Latency/Weighted |
Exam Tip
Exam Essential: When given RTO/RPO requirements, match them to the appropriate DR strategy. "Minimize downtime" = Active-Active. "Minimize cost" = Backup & Restore.
SAA-C03 Exam Focus Points
- ✅ RTO/RPO-based DR selection: Match requirements to strategy
- ✅ Cost vs Availability: Lower RTO/RPO = Higher cost
- ✅ Service capabilities: RDS Multi-AZ = sync replication; Aurora Global = async
- ✅ Route 53 role: Health Check + Failover for DR; Latency routing for Active-Active
Key Memorization
| Keyword | Association |
|---|---|
| AZ failure protection | Multi-AZ |
| Region failure protection | Multi-Region |
| Synchronous replication | Multi-AZ, RPO=0 |
| Asynchronous replication | Multi-Region, RPO>0 |
| Lowest cost DR | Backup and Restore |
| Lowest RTO | Active-Active |
| Core only running | Pilot Light |
| Scaled-down operation | Warm Standby |
Exam Tip
Sample Exam Question: "A company requires RPO of 5 minutes and RTO of 1 hour for their mission-critical application. Which DR strategy should they implement?" → Answer: Warm Standby (Provides minutes-level RPO with async replication and minutes-level RTO with pre-running infrastructure)
Frequently Asked Questions
Q: Is Multi-AZ enough for disaster recovery?
Multi-AZ protects against AZ failures but not regional failures (natural disasters, widespread outages). For business continuity, consider Multi-Region.
Q: How do you handle data conflicts in Active-Active?
DynamoDB Global Tables uses "last writer wins" conflict resolution. For finer control, implement application-level conflict resolution logic.
Q: How often should you test DR?
AWS recommends quarterly DR tests (Game Days) minimum. Critical systems should test monthly, and document all test results.
Q: What's the practical difference between Pilot Light and Warm Standby?
- Pilot Light: Only database running, app servers stopped
- Warm Standby: Full stack running but scaled down
Warm Standby provides faster recovery but costs more.
Q: Aurora Global Database vs DynamoDB Global Tables?
- Aurora Global: Relational data, SQL needed, ACID transactions
- DynamoDB Global: NoSQL, millisecond replication, Active-Active writes