Multi-AZ vs Multi-Region: AWS High Availability and Disaster Recovery Guide

Key Takeaway

Multi-AZ is for High Availability (HA), Multi-Region is for Disaster Recovery (DR). Use Multi-AZ for AZ failures; use Multi-Region for regional failures or geographic requirements.

Quick Comparison

Aspect	Multi-AZ	Multi-Region
Purpose	High Availability	Disaster Recovery
Protection	AZ failure	Region failure, natural disasters
Failover Time	Automatic, 1-2 min	Manual/Auto, minutes to hours
Replication	Synchronous	Asynchronous
Latency	Milliseconds	Tens to hundreds of ms
Cost	~2x baseline	2x+ baseline
Complexity	Low	High

Exam Tip

Exam Essential: "AZ failure protection" = Multi-AZ. "Region failure" or "natural disaster recovery" = Multi-Region. Consider cost and complexity trade-offs.

Multi-AZ Architecture

What is Multi-AZ?

Multi-AZ distributes resources across multiple Availability Zones within a single Region.

┌─────────────────────────────────────────────────┐
│                  Seoul Region                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  │    AZ-a     │  │    AZ-b     │  │    AZ-c     │
│  │  ┌───────┐  │  │  ┌───────┐  │  │  ┌───────┐  │
│  │  │  EC2  │  │  │  │  EC2  │  │  │  │  EC2  │  │
│  │  └───────┘  │  │  └───────┘  │  │  └───────┘  │
│  │  ┌───────┐  │  │  ┌───────┐  │  │             │
│  │  │  RDS  │←─┼──┼─→│Standby│  │  │             │
│  │  │Primary│  │  │  │  RDS  │  │  │             │
│  │  └───────┘  │  │  └───────┘  │  │             │
│  └─────────────┘  └─────────────┘  └─────────────┘
└─────────────────────────────────────────────────┘

Multi-AZ Features

Feature	Description
Synchronous replication	Real-time data sync between Primary and Standby
Automatic failover	Automatic switch to Standby on failure detection
Single endpoint	DNS name unchanged, only IP changes
Same region	Low latency, simple architecture

Multi-AZ Supported Services

✅ RDS (automatic failover)
✅ ElastiCache (Redis cluster mode)
✅ EFS (Multi-AZ by default)
✅ Aurora (Multi-AZ by default)
✅ OpenSearch (Multi-AZ deployment)
✅ MSK (Multi-AZ recommended)

Multi-Region Architecture

What is Multi-Region?

Multi-Region distributes infrastructure across multiple AWS Regions.

┌──────────────────────┐         ┌──────────────────────┐
│    Seoul Region      │         │    Tokyo Region      │
│   (ap-northeast-2)   │         │   (ap-northeast-1)   │
│  ┌────────────────┐  │         │  ┌────────────────┐  │
│  │   Primary DB   │──┼─ Repl ──┼─→│   Replica DB   │  │
│  │   (Aurora)     │  │         │  │   (Aurora)     │  │
│  └────────────────┘  │         │  └────────────────┘  │
│  ┌────────────────┐  │         │  ┌────────────────┐  │
│  │  Application   │  │         │  │  Application   │  │
│  │    Servers     │  │         │  │   (Standby)    │  │
│  └────────────────┘  │         │  └────────────────┘  │
└──────────────────────┘         └──────────────────────┘
              │                            │
              └──────────┬─────────────────┘
                         │
                  ┌──────┴──────┐
                  │  Route 53   │
                  │  (Failover) │
                  └─────────────┘

Multi-Region Supported Services

✅ Aurora Global Database (under 1 second replication)
✅ DynamoDB Global Tables (millisecond replication)
✅ S3 Cross-Region Replication (CRR)
✅ Route 53 (Global DNS)
✅ CloudFront (Global CDN)
✅ Global Accelerator (Global network)

DR Strategy Comparison

Understanding RTO and RPO

Metric	Definition	Question
RPO	Recovery Point Objective	"How much data can we afford to lose?"
RTO	Recovery Time Objective	"How quickly must we recover?"

        Disaster Occurs
            │
            ▼
──────┬─────┼─────┬──────────────────→ Time
      │     │     │
      │◄───►│     │◄──────────────►│
      │ RPO │     │      RTO       │
      │     │     │                │
   Last   Disaster  Recovery     Service
   Backup            Starts      Restored

Exam Tip

Exam Tip: If RPO must be near "0", you need synchronous replication. If RTO must be near "0", you need Hot Standby or Active-Active.

The Four DR Strategies

1. Backup and Restore

The most basic DR strategy—periodically backup data and restore during disasters.

Metric	Value
RTO	Hours to 24 hours
RPO	Hours (since last backup)
Cost	💰 (Lowest)
Complexity	⭐ (Simplest)

Best for: Non-critical systems, dev/test environments, cost-sensitive workloads

2. Pilot Light

Keep only core systems running minimally in DR region; provision the rest during disaster.

Metric	Value
RTO	Minutes to hours
RPO	Minutes (async replication lag)
Cost	💰💰 (Low-Medium)
Complexity	⭐⭐ (Medium)

Best for: Core business systems, balanced cost vs recovery time

3. Warm Standby

Run a scaled-down version of the full system in DR region.

Metric	Value
RTO	Minutes
RPO	Seconds to minutes
Cost	💰💰💰 (Medium-High)
Complexity	⭐⭐⭐ (High)

Best for: Critical systems requiring fast recovery, some downtime acceptable

4. Multi-Site Active-Active

Both regions actively handle traffic simultaneously.

Metric	Value
RTO	Near-zero
RPO	Near-zero
Cost	💰💰💰💰 (Highest)
Complexity	⭐⭐⭐⭐ (Highest)

Best for: Mission-critical systems, zero downtime tolerance, global users

DR Strategy Summary

Strategy	RTO	RPO	Cost	Automation
Backup/Restore	24h+	Hours	$	Manual
Pilot Light	Hours	Minutes	$$	Semi-auto
Warm Standby	Minutes	Seconds	$$$	Auto
Active-Active	~0	~0	$$$$	Fully auto

AWS Services by DR Strategy

Strategy	Database	Compute	Networking
Backup/Restore	RDS Snapshots + S3 CRR	AMI Copy	Route 53 manual
Pilot Light	Aurora Global (replica only)	EC2 AMIs (stopped)	Route 53 Failover
Warm Standby	Aurora Global (scaled down)	EC2 minimal running	Route 53 Failover
Active-Active	DynamoDB Global Tables	Full scale both sides	Route 53 Latency/Weighted

Exam Tip

Exam Essential: When given RTO/RPO requirements, match them to the appropriate DR strategy. "Minimize downtime" = Active-Active. "Minimize cost" = Backup & Restore.

Keyword	Association
AZ failure protection	Multi-AZ
Region failure protection	Multi-Region
Synchronous replication	Multi-AZ, RPO=0
Asynchronous replication	Multi-Region, RPO>0
Lowest cost DR	Backup and Restore
Lowest RTO	Active-Active
Core only running	Pilot Light
Scaled-down operation	Warm Standby

Sample Exam Question: "A company requires RPO of 5 minutes and RTO of 1 hour for their mission-critical application. Which DR strategy should they implement?" → Answer: Warm Standby (Provides minutes-level RPO with async replication and minutes-level RTO with pre-running infrastructure)

Pilot Light: Only database running, app servers stopped
Warm Standby: Full stack running but scaled down

Warm Standby provides faster recovery but costs more.

Q: Aurora Global Database vs DynamoDB Global Tables?

Aurora Global: Relational data, SQL needed, ACID transactions
DynamoDB Global: NoSQL, millisecond replication, Active-Active writes

Multi-AZ vs Multi-Region: AWS High Availability and Disaster Recovery Guide

Key Takeaway

Quick Comparison

Multi-AZ Architecture

What is Multi-AZ?

Multi-AZ Features

Multi-AZ Supported Services

Multi-Region Architecture

What is Multi-Region?

Multi-Region Supported Services

DR Strategy Comparison

Understanding RTO and RPO

The Four DR Strategies

1. Backup and Restore

2. Pilot Light

3. Warm Standby

4. Multi-Site Active-Active

DR Strategy Summary

AWS Services by DR Strategy

SAA-C03 Exam Focus Points

Key Memorization

Frequently Asked Questions

Q: Is Multi-AZ enough for disaster recovery?

Q: How do you handle data conflicts in Active-Active?

Q: How often should you test DR?

Q: What's the practical difference between Pilot Light and Warm Standby?

Q: Aurora Global Database vs DynamoDB Global Tables?

References