Amazon Athena: Query S3 Data with SQL Without Managing Servers

Key Takeaway

Amazon Athena is a serverless query service that lets you analyze S3 data using SQL without managing infrastructure. You pay only for data scanned ($5/TB), and partitioning with Parquet format can reduce costs by up to 97%.

Exam Tip

Exam Essential: "Ad-hoc S3 data analysis" → Athena, "Single object filtering" → S3 Select, "Large-scale data warehouse" → Redshift

When Should You Use Athena?

Best For

Athena Recommended Scenarios:
├── Web/application log analysis
├── CloudTrail/VPC Flow logs querying
├── CSV/JSON/Parquet data stored in S3
├── Quick ad-hoc analysis needs
├── Data lake query layer
└── SQL execution without infrastructure

Not Ideal For

Cases Where Athena Isn't the Best Fit:
├── Real-time queries (millisecond response)
│   → ElastiCache, DynamoDB
├── Complex ETL transformations
│   → AWS Glue, EMR
├── Large-scale joins/sophisticated reporting
│   → Amazon Redshift
└── Simple filtering of single files
    → S3 Select

Core Architecture

How Athena Works

┌─────────────────────────────────────────────────────────────┐
│                       Amazon Athena                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   [SQL Query]                                                │
│       │                                                      │
│       ▼                                                      │
│   ┌──────────┐     ┌──────────────┐     ┌──────────────┐   │
│   │  Athena  │────▶│  AWS Glue    │────▶│     S3       │   │
│   │  Engine  │     │  Data Catalog│     │   (Data)     │   │
│   │ (Trino)  │     │  (Metadata)  │     │              │   │
│   └──────────┘     └──────────────┘     └──────────────┘   │
│       │                                                      │
│       ▼                                                      │
│   ┌──────────┐                                              │
│   │   S3     │  ← Query results stored                      │
│   │ (Output) │                                              │
│   └──────────┘                                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

AWS Glue Data Catalog Integration

Glue Data Catalog Role:
├── Stores database/table metadata
├── Schema definitions (columns, data types)
├── Partition information management
└── Shared across Athena, Redshift Spectrum, EMR

Supported Data Formats

Format	Type	Features	Cost Efficiency
Parquet	Columnar	High compression, column selection	⭐⭐⭐⭐⭐
ORC	Columnar	Hive optimized, excellent compression	⭐⭐⭐⭐⭐
JSON	Row-based	Nested structure support	⭐⭐⭐
CSV	Row-based	Simple, universal	⭐⭐
Avro	Row-based	Schema evolution support	⭐⭐⭐

Exam Tip

Exam Point: Athena cost optimization = Parquet/ORC format + Partitioning + Compression (Snappy, GZIP)

Pricing Structure

Base Charges

Athena SQL Query Cost:
├── $5 / TB (based on data scanned)
├── Minimum 10MB charge per query
└── S3 storage for results charged separately

Examples:
├── Full scan of 1TB text file → $5
├── Scan only 100GB partition → $0.50
└── 10GB Parquet column selection → $0.05

Cost Optimization Strategies

Cost Reduction Impact:
┌─────────────────────────────────────────────────────────────┐
│ Optimization Method             │ Cost Reduction           │
├─────────────────────────────────────────────────────────────┤
│ Convert to Parquet/ORC          │ 3-5x savings             │
│ Apply partitioning              │ 10-100x savings          │
│ Column compression (Snappy/GZIP)│ 2-3x savings             │
│ Select specific columns vs *    │ 2-10x savings            │
│ Bucketing (high-cardinality)    │ Up to 97% savings        │
└─────────────────────────────────────────────────────────────┘

Partitioning: The Key to Cost Optimization

Hive-Style Partitioning

Without Partitioning:
s3://logs-bucket/access-logs/
├── log-001.parquet
├── log-002.parquet
└── ... (entire year of logs)
→ Full year scan: $50 (assuming 10TB)

With Partitioning:
s3://logs-bucket/access-logs/
├── year=2026/
│   ├── month=01/
│   │   ├── day=01/
│   │   │   └── log-001.parquet
│   │   └── day=02/
│   └── month=02/
└── year=2025/
→ Specific date only: $0.50 (100GB)

Partition Projection

-- Partition Projection Setup (No MSCK REPAIR needed)
CREATE EXTERNAL TABLE access_logs (
  request_id STRING,
  user_id STRING,
  action STRING
)
PARTITIONED BY (year INT, month INT, day INT)
LOCATION 's3://logs-bucket/access-logs/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2020,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31',
  'storage.location.template' =
    's3://logs-bucket/access-logs/year=${year}/month=${month}/day=${day}'
);

Exam Tip

Partition Projection: Stores partition rules in table properties instead of Glue Catalog. Eliminates MSCK REPAIR TABLE for large datasets and improves query planning performance.

Athena vs S3 Select vs Redshift

Comparison Table

Feature	Athena	S3 Select	Redshift
Query Scope	Multiple files/tables	Single object	Full data warehouse
SQL Features	Full SQL (JOIN, aggregates, window)	Basic SELECT/WHERE	Full SQL + analytics
Schema	Glue Catalog required	Not required	Table definition needed
Infrastructure	Serverless	Serverless	Cluster (Serverless option)
Data Size	GB~TB	KB~GB	TB~PB
Latency	Seconds~minutes	Milliseconds	Milliseconds~seconds
Cost	$5/TB scanned	$0.002/GB scanned	Hourly billing

Selection Decision Flow

Need to analyze S3 data?
        │
        ▼
Simple filtering from single object?
        │
       Yes → [S3 Select]
        │
        No
        │
        ▼
Ad-hoc SQL analysis across multiple files?
        │
       Yes → [Athena]
        │
        No
        │
        ▼
Complex joins/large-scale ETL/BI reporting?
        │
       Yes → [Redshift / Redshift Spectrum]

Federated Queries

Query Multiple Data Sources

┌─────────────────────────────────────────────────────────────┐
│                     Athena Federated Query                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│            ┌─────────────┐                                  │
│            │   Athena    │                                  │
│            └──────┬──────┘                                  │
│                   │                                          │
│      ┌────────────┼────────────┬─────────────┐              │
│      ▼            ▼            ▼             ▼              │
│   ┌─────┐    ┌─────────┐  ┌─────────┐  ┌──────────┐        │
│   │ S3  │    │DynamoDB │  │  RDS    │  │ Redshift │        │
│   └─────┘    └─────────┘  └─────────┘  └──────────┘        │
│                                                              │
│   → Join across sources using Lambda connectors             │
└─────────────────────────────────────────────────────────────┘

Athena Spark

Apache Spark Integration

Athena Spark Features:
├── Notebook interface for Spark code
├── Python, Scala support
├── Complex data transformations
├── ML library access
└── DPU-hour pricing ($0.35/DPU-hour)

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

✅ Service Selection: "Analyze S3 logs with SQL" → Athena
✅ Cost Optimization: "Reduce Athena costs" → Partitioning + Parquet
✅ Service Comparison: "S3 Select vs Athena" → Single object vs multiple objects
✅ Integration: "Glue Catalog purpose" → Metadata store

Sample Exam Questions

Exam Tip

Sample Exam Question 1: "A company stores several TBs of logs in S3 daily. They need to analyze errors from specific dates only. How can they minimize costs?"

→ Answer: Date-based partitioning + Athena query (scan only required partitions)

Exam Tip

Sample Exam Question 2: "You need to extract specific records from a 1GB JSON file stored in S3. What is the most cost-effective approach?"

→ Answer: S3 Select (simple filtering from single object)

Amazon Athena: Query S3 Data with SQL Without Managing Servers

Key Takeaway

When Should You Use Athena?

Best For

Not Ideal For

Core Architecture

How Athena Works

AWS Glue Data Catalog Integration

Supported Data Formats

Pricing Structure

Base Charges

Cost Optimization Strategies

Partitioning: The Key to Cost Optimization

Hive-Style Partitioning

Partition Projection

Athena vs S3 Select vs Redshift

Comparison Table

Selection Decision Flow

Federated Queries

Query Multiple Data Sources

Athena Spark

Apache Spark Integration

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

Sample Exam Questions

Frequently Asked Questions

Q: What's the difference between Athena and Redshift Spectrum?

Q: Where are Athena query results stored?

Q: When is MSCK REPAIR TABLE needed?

Q: Is Athena suitable for real-time queries?

Q: Can Athena perform UPDATE/DELETE operations?

References