Amazon Redshift Basics: When to Use a Data Warehouse
Learn Amazon Redshift's core concepts, compare with RDS and Athena, and understand when a data warehouse is the right choice.
Related Exam Domains
- Domain 3: Design High-Performing Architectures
Key Takeaway
Amazon Redshift is a data warehouse that stores petabyte-scale data in columnar format and processes complex analytical queries (OLAP) rapidly. Choose RDS for OLTP (transactions), Athena for ad-hoc S3 queries, and Redshift for large-scale analytics/BI.
Exam Tip
Exam Essential: "Data warehouse, OLAP, BI, PB-scale analytics" → Redshift, "OLTP, transactions" → RDS, "S3 serverless queries" → Athena
When Should You Use Redshift?
Good Fit
- Business Intelligence (BI): Complex aggregation, join queries
- Data Warehousing: Unified analysis of data from multiple sources
- Petabyte-Scale Analytics: Analytical queries on massive datasets
- Regular Reporting: Daily/weekly/monthly dashboards
- Post-ETL Analysis: Load data from S3 → Redshift for analysis
Not a Good Fit
- OLTP Workloads: Real-time transactions → RDS/Aurora
- Ad-hoc S3 Queries: Simple S3 analysis → Athena
- NoSQL: Key-value/document data → DynamoDB
- Real-time Streaming Analytics: Millisecond latency → Kinesis Data Analytics
Redshift Core Features
Columnar Storage
Row-based (RDS):
ID | Name | Age | City
1 | Kim | 30 | Seoul
2 | Lee | 25 | Busan
3 | Park | 35 | Seoul
→ All columns of a row stored together
→ Good for specific row lookups (OLTP)
Column-based (Redshift):
ID: 1, 2, 3
Name: Kim, Lee, Park
Age: 30, 25, 35
City: Seoul, Busan, Seoul
→ Values of same column stored together
→ Good for aggregation queries (SUM, AVG, COUNT)
→ High compression ratio
Exam Tip
Columnar = Optimal for Analytics: For queries like SELECT AVG(age) FROM users, only the age column needs to be read, making it much faster than row-based.
MPP (Massively Parallel Processing)
┌──────────────────────────────────────┐
│ Leader Node │
│ (Query parsing, planning, results) │
└──────────┬───────────────────────────┘
│
┌─────┼─────┬─────────┐
▼ ▼ ▼ ▼
┌─────┐┌─────┐┌─────┐┌─────┐
│Comp ││Comp ││Comp ││Comp │
│Node1││Node2││Node3││Node4│
│ ││ ││ ││ │
│Slice││Slice││Slice││Slice│
└─────┘└─────┘└─────┘└─────┘
→ Queries split and executed in parallel
Node Types
| Type | Characteristics | Use Case |
|---|---|---|
| RA3 | Storage/compute separation, managed storage | Most workloads (recommended) |
| DC2 | SSD-based, fast local storage | Under 500GB |
| DS2 | HDD-based, large capacity | Legacy (migrate to RA3) |
Redshift Serverless
Automatically provisions capacity without cluster management.
| Aspect | Redshift Provisioned | Redshift Serverless |
|---|---|---|
| Infrastructure | Manually set node count/type | Automatic |
| Billing | Node-hour based | RPU (Redshift Processing Unit) usage |
| Scaling | Manual resize / Elastic Resize | Automatic |
| Best For | Predictable workloads | Intermittent/variable workloads |
Redshift Spectrum
Query S3 data directly from Redshift (no data loading required).
[Redshift Cluster]
│
├── Internal Tables (Redshift storage)
│
└── External Tables (S3 data) ← Spectrum
│
▼
[S3: Logs, historical data]
→ Query S3 data without loading into Redshift
Exam Tip
Redshift Spectrum vs Athena: Both query S3 data, but Spectrum requires a Redshift cluster and can join with internal tables. Athena is fully serverless.
Redshift vs RDS vs Athena
| Aspect | RDS/Aurora | Redshift | Athena |
|---|---|---|---|
| Type | OLTP (Transactions) | OLAP (Analytics) | Serverless queries |
| Storage | Row-based | Column-based | S3 (no internal storage) |
| Data Scale | GB ~ TB | TB ~ PB | Depends on S3 |
| Query Types | INSERT/UPDATE/SELECT | Complex aggregations/joins | Ad-hoc queries |
| Infrastructure | Managed instances | Cluster / Serverless | Fully serverless |
| Billing | Instance hours | Node hours / RPU | Data scanned ($5/TB) |
| Use Cases | App backend, CRUD | BI, dashboards, reports | S3 log analysis |
Redshift Feature Summary
| Feature | Description |
|---|---|
| Enhanced VPC Routing | Force all COPY/UNLOAD traffic through VPC |
| Snapshots | Auto/manual backups, cross-region copy |
| Federated Query | Directly query RDS/Aurora data |
| Data Sharing | Share data between clusters (no copying) |
| Concurrency Scaling | Auto-scale on query spikes |
| Encryption | Encrypt data at rest with KMS/HSM |
SAA-C03 Exam Focus Points
- ✅ OLAP vs OLTP: "Analytics queries = Redshift, Transactions = RDS"
- ✅ Columnar: "Columnar storage optimized for aggregation queries"
- ✅ Spectrum: "Query S3 data directly from Redshift"
- ✅ Athena Difference: "Athena = serverless ad-hoc, Redshift = ongoing analytics"
- ✅ Enhanced VPC Routing: "Force data transfer within VPC"
Exam Tip
Sample Exam Question: "Need to consolidate data from multiple sources and generate PB-scale BI reports. Which service is appropriate?" → Answer: Amazon Redshift (OLAP, PB-scale, BI analytics)
Frequently Asked Questions (FAQ)
Q: When should I choose Redshift vs Athena?
Athena is best for intermittent ad-hoc queries on S3 data. Redshift is best for consolidating data from multiple sources for ongoing complex analysis.
Q: Should I choose Redshift Serverless or Provisioned?
Provisioned is more cost-effective for predictable, continuous workloads. Serverless is better for intermittent, variable workloads.
Q: Can Redshift process real-time data?
Redshift Streaming Ingestion can ingest Kinesis Data Streams data in near real-time. However, for millisecond-level real-time analytics, Kinesis Data Analytics is more suitable.
Q: How do I load data into Redshift?
Use COPY command to load from S3, DynamoDB, or EMR. COPY from S3 is the most common and fastest method. Use INSERT only for small data volumes.
Q: How can I reduce Redshift costs?
Consider Reserved Instances (1-year/3-year commitment) for up to 75% savings, use Concurrency Scaling free credits, delete unnecessary snapshots, and use RA3 nodes to separate storage costs.
Related Posts
- RDS vs Aurora vs DynamoDB Selection Guide
- ElastiCache (Redis vs Memcached)
- S3 Storage Classes Complete Guide