Amazon Athena: Query S3 Data with SQL Without Managing Servers
Analyze S3 data using SQL with Amazon Athena. Learn partitioning for 97% cost reduction, S3 Select vs Redshift comparison.
Related Exam Domains
- Domain 3: Design High-Performing Architectures
- Domain 4: Design Cost-Optimized Architectures
Key Takeaway
Amazon Athena is a serverless query service that lets you analyze S3 data using SQL without managing infrastructure. You pay only for data scanned ($5/TB), and partitioning with Parquet format can reduce costs by up to 97%.
Exam Tip
Exam Essential: "Ad-hoc S3 data analysis" → Athena, "Single object filtering" → S3 Select, "Large-scale data warehouse" → Redshift
When Should You Use Athena?
Best For
Athena Recommended Scenarios:
├── Web/application log analysis
├── CloudTrail/VPC Flow logs querying
├── CSV/JSON/Parquet data stored in S3
├── Quick ad-hoc analysis needs
├── Data lake query layer
└── SQL execution without infrastructure
Not Ideal For
Cases Where Athena Isn't the Best Fit:
├── Real-time queries (millisecond response)
│ → ElastiCache, DynamoDB
├── Complex ETL transformations
│ → AWS Glue, EMR
├── Large-scale joins/sophisticated reporting
│ → Amazon Redshift
└── Simple filtering of single files
→ S3 Select
Core Architecture
How Athena Works
┌─────────────────────────────────────────────────────────────┐
│ Amazon Athena │
├─────────────────────────────────────────────────────────────┤
│ │
│ [SQL Query] │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Athena │────▶│ AWS Glue │────▶│ S3 │ │
│ │ Engine │ │ Data Catalog│ │ (Data) │ │
│ │ (Trino) │ │ (Metadata) │ │ │ │
│ └──────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ S3 │ ← Query results stored │
│ │ (Output) │ │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
AWS Glue Data Catalog Integration
Glue Data Catalog Role:
├── Stores database/table metadata
├── Schema definitions (columns, data types)
├── Partition information management
└── Shared across Athena, Redshift Spectrum, EMR
Supported Data Formats
| Format | Type | Features | Cost Efficiency |
|---|---|---|---|
| Parquet | Columnar | High compression, column selection | ⭐⭐⭐⭐⭐ |
| ORC | Columnar | Hive optimized, excellent compression | ⭐⭐⭐⭐⭐ |
| JSON | Row-based | Nested structure support | ⭐⭐⭐ |
| CSV | Row-based | Simple, universal | ⭐⭐ |
| Avro | Row-based | Schema evolution support | ⭐⭐⭐ |
Exam Tip
Exam Point: Athena cost optimization = Parquet/ORC format + Partitioning + Compression (Snappy, GZIP)
Pricing Structure
Base Charges
Athena SQL Query Cost:
├── $5 / TB (based on data scanned)
├── Minimum 10MB charge per query
└── S3 storage for results charged separately
Examples:
├── Full scan of 1TB text file → $5
├── Scan only 100GB partition → $0.50
└── 10GB Parquet column selection → $0.05
Cost Optimization Strategies
Cost Reduction Impact:
┌─────────────────────────────────────────────────────────────┐
│ Optimization Method │ Cost Reduction │
├─────────────────────────────────────────────────────────────┤
│ Convert to Parquet/ORC │ 3-5x savings │
│ Apply partitioning │ 10-100x savings │
│ Column compression (Snappy/GZIP)│ 2-3x savings │
│ Select specific columns vs * │ 2-10x savings │
│ Bucketing (high-cardinality) │ Up to 97% savings │
└─────────────────────────────────────────────────────────────┘
Partitioning: The Key to Cost Optimization
Hive-Style Partitioning
Without Partitioning:
s3://logs-bucket/access-logs/
├── log-001.parquet
├── log-002.parquet
└── ... (entire year of logs)
→ Full year scan: $50 (assuming 10TB)
With Partitioning:
s3://logs-bucket/access-logs/
├── year=2026/
│ ├── month=01/
│ │ ├── day=01/
│ │ │ └── log-001.parquet
│ │ └── day=02/
│ └── month=02/
└── year=2025/
→ Specific date only: $0.50 (100GB)
Partition Projection
-- Partition Projection Setup (No MSCK REPAIR needed)
CREATE EXTERNAL TABLE access_logs (
request_id STRING,
user_id STRING,
action STRING
)
PARTITIONED BY (year INT, month INT, day INT)
LOCATION 's3://logs-bucket/access-logs/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31',
'storage.location.template' =
's3://logs-bucket/access-logs/year=${year}/month=${month}/day=${day}'
);
Exam Tip
Partition Projection: Stores partition rules in table properties instead of Glue Catalog. Eliminates MSCK REPAIR TABLE for large datasets and improves query planning performance.
Athena vs S3 Select vs Redshift
Comparison Table
| Feature | Athena | S3 Select | Redshift |
|---|---|---|---|
| Query Scope | Multiple files/tables | Single object | Full data warehouse |
| SQL Features | Full SQL (JOIN, aggregates, window) | Basic SELECT/WHERE | Full SQL + analytics |
| Schema | Glue Catalog required | Not required | Table definition needed |
| Infrastructure | Serverless | Serverless | Cluster (Serverless option) |
| Data Size | GB~TB | KB~GB | TB~PB |
| Latency | Seconds~minutes | Milliseconds | Milliseconds~seconds |
| Cost | $5/TB scanned | $0.002/GB scanned | Hourly billing |
Selection Decision Flow
Need to analyze S3 data?
│
▼
Simple filtering from single object?
│
Yes → [S3 Select]
│
No
│
▼
Ad-hoc SQL analysis across multiple files?
│
Yes → [Athena]
│
No
│
▼
Complex joins/large-scale ETL/BI reporting?
│
Yes → [Redshift / Redshift Spectrum]
Federated Queries
Query Multiple Data Sources
┌─────────────────────────────────────────────────────────────┐
│ Athena Federated Query │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Athena │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────┼────────────┬─────────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
│ │ S3 │ │DynamoDB │ │ RDS │ │ Redshift │ │
│ └─────┘ └─────────┘ └─────────┘ └──────────┘ │
│ │
│ → Join across sources using Lambda connectors │
└─────────────────────────────────────────────────────────────┘
Athena Spark
Apache Spark Integration
Athena Spark Features:
├── Notebook interface for Spark code
├── Python, Scala support
├── Complex data transformations
├── ML library access
└── DPU-hour pricing ($0.35/DPU-hour)
SAA-C03 Exam Focus Points
Commonly Tested Scenarios
- ✅ Service Selection: "Analyze S3 logs with SQL" → Athena
- ✅ Cost Optimization: "Reduce Athena costs" → Partitioning + Parquet
- ✅ Service Comparison: "S3 Select vs Athena" → Single object vs multiple objects
- ✅ Integration: "Glue Catalog purpose" → Metadata store
Sample Exam Questions
Exam Tip
Sample Exam Question 1: "A company stores several TBs of logs in S3 daily. They need to analyze errors from specific dates only. How can they minimize costs?"
→ Answer: Date-based partitioning + Athena query (scan only required partitions)
Exam Tip
Sample Exam Question 2: "You need to extract specific records from a 1GB JSON file stored in S3. What is the most cost-effective approach?"
→ Answer: S3 Select (simple filtering from single object)
Frequently Asked Questions
Q: What's the difference between Athena and Redshift Spectrum?
Athena is a standalone serverless query service. Redshift Spectrum queries S3 data from within a Redshift cluster. Use Spectrum if you already have Redshift; use Athena for pure S3 analysis.
Q: Where are Athena query results stored?
Query results are stored as CSV files in a specified S3 bucket. The output bucket is required, and S3 storage costs apply separately.
Q: When is MSCK REPAIR TABLE needed?
When new partitions are added, they must be registered in Glue Catalog. MSCK REPAIR TABLE auto-discovers partitions, or use partition projection to eliminate this step entirely.
Q: Is Athena suitable for real-time queries?
No. Athena has second-to-minute response times, best for ad-hoc analysis. For millisecond responses, consider ElastiCache, DynamoDB, or provisioned Redshift.
Q: Can Athena perform UPDATE/DELETE operations?
Not by default. Athena is a read-only query service. However, using Apache Iceberg tables enables ACID transactions (INSERT, UPDATE, DELETE).