SAABlog
DatabaseIntermediate

Amazon Athena: Query S3 Data with SQL Without Managing Servers

Analyze S3 data using SQL with Amazon Athena. Learn partitioning for 97% cost reduction, S3 Select vs Redshift comparison.

PHILOLAMB-
AthenaServerlessSQLS3 AnalyticsData Lake

Related Exam Domains

  • Domain 3: Design High-Performing Architectures
  • Domain 4: Design Cost-Optimized Architectures

Key Takeaway

Amazon Athena is a serverless query service that lets you analyze S3 data using SQL without managing infrastructure. You pay only for data scanned ($5/TB), and partitioning with Parquet format can reduce costs by up to 97%.

Exam Tip

Exam Essential: "Ad-hoc S3 data analysis" → Athena, "Single object filtering" → S3 Select, "Large-scale data warehouse" → Redshift


When Should You Use Athena?

Best For

Athena Recommended Scenarios:
├── Web/application log analysis
├── CloudTrail/VPC Flow logs querying
├── CSV/JSON/Parquet data stored in S3
├── Quick ad-hoc analysis needs
├── Data lake query layer
└── SQL execution without infrastructure

Not Ideal For

Cases Where Athena Isn't the Best Fit:
├── Real-time queries (millisecond response)
│   → ElastiCache, DynamoDB
├── Complex ETL transformations
│   → AWS Glue, EMR
├── Large-scale joins/sophisticated reporting
│   → Amazon Redshift
└── Simple filtering of single files
    → S3 Select

Core Architecture

How Athena Works

┌─────────────────────────────────────────────────────────────┐
│                       Amazon Athena                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   [SQL Query]                                                │
│       │                                                      │
│       ▼                                                      │
│   ┌──────────┐     ┌──────────────┐     ┌──────────────┐   │
│   │  Athena  │────▶│  AWS Glue    │────▶│     S3       │   │
│   │  Engine  │     │  Data Catalog│     │   (Data)     │   │
│   │ (Trino)  │     │  (Metadata)  │     │              │   │
│   └──────────┘     └──────────────┘     └──────────────┘   │
│       │                                                      │
│       ▼                                                      │
│   ┌──────────┐                                              │
│   │   S3     │  ← Query results stored                      │
│   │ (Output) │                                              │
│   └──────────┘                                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

AWS Glue Data Catalog Integration

Glue Data Catalog Role:
├── Stores database/table metadata
├── Schema definitions (columns, data types)
├── Partition information management
└── Shared across Athena, Redshift Spectrum, EMR

Supported Data Formats

FormatTypeFeaturesCost Efficiency
ParquetColumnarHigh compression, column selection⭐⭐⭐⭐⭐
ORCColumnarHive optimized, excellent compression⭐⭐⭐⭐⭐
JSONRow-basedNested structure support⭐⭐⭐
CSVRow-basedSimple, universal⭐⭐
AvroRow-basedSchema evolution support⭐⭐⭐

Exam Tip

Exam Point: Athena cost optimization = Parquet/ORC format + Partitioning + Compression (Snappy, GZIP)


Pricing Structure

Base Charges

Athena SQL Query Cost:
├── $5 / TB (based on data scanned)
├── Minimum 10MB charge per query
└── S3 storage for results charged separately

Examples:
├── Full scan of 1TB text file → $5
├── Scan only 100GB partition → $0.50
└── 10GB Parquet column selection → $0.05

Cost Optimization Strategies

Cost Reduction Impact:
┌─────────────────────────────────────────────────────────────┐
│ Optimization Method             │ Cost Reduction           │
├─────────────────────────────────────────────────────────────┤
│ Convert to Parquet/ORC          │ 3-5x savings             │
│ Apply partitioning              │ 10-100x savings          │
│ Column compression (Snappy/GZIP)│ 2-3x savings             │
│ Select specific columns vs *    │ 2-10x savings            │
│ Bucketing (high-cardinality)    │ Up to 97% savings        │
└─────────────────────────────────────────────────────────────┘

Partitioning: The Key to Cost Optimization

Hive-Style Partitioning

Without Partitioning:
s3://logs-bucket/access-logs/
├── log-001.parquet
├── log-002.parquet
└── ... (entire year of logs)
→ Full year scan: $50 (assuming 10TB)

With Partitioning:
s3://logs-bucket/access-logs/
├── year=2026/
│   ├── month=01/
│   │   ├── day=01/
│   │   │   └── log-001.parquet
│   │   └── day=02/
│   └── month=02/
└── year=2025/
→ Specific date only: $0.50 (100GB)

Partition Projection

-- Partition Projection Setup (No MSCK REPAIR needed)
CREATE EXTERNAL TABLE access_logs (
  request_id STRING,
  user_id STRING,
  action STRING
)
PARTITIONED BY (year INT, month INT, day INT)
LOCATION 's3://logs-bucket/access-logs/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2020,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31',
  'storage.location.template' =
    's3://logs-bucket/access-logs/year=${year}/month=${month}/day=${day}'
);

Exam Tip

Partition Projection: Stores partition rules in table properties instead of Glue Catalog. Eliminates MSCK REPAIR TABLE for large datasets and improves query planning performance.


Athena vs S3 Select vs Redshift

Comparison Table

FeatureAthenaS3 SelectRedshift
Query ScopeMultiple files/tablesSingle objectFull data warehouse
SQL FeaturesFull SQL (JOIN, aggregates, window)Basic SELECT/WHEREFull SQL + analytics
SchemaGlue Catalog requiredNot requiredTable definition needed
InfrastructureServerlessServerlessCluster (Serverless option)
Data SizeGB~TBKB~GBTB~PB
LatencySeconds~minutesMillisecondsMilliseconds~seconds
Cost$5/TB scanned$0.002/GB scannedHourly billing

Selection Decision Flow

Need to analyze S3 data?
        │
        ▼
Simple filtering from single object?
        │
       Yes → [S3 Select]
        │
        No
        │
        ▼
Ad-hoc SQL analysis across multiple files?
        │
       Yes → [Athena]
        │
        No
        │
        ▼
Complex joins/large-scale ETL/BI reporting?
        │
       Yes → [Redshift / Redshift Spectrum]

Federated Queries

Query Multiple Data Sources

┌─────────────────────────────────────────────────────────────┐
│                     Athena Federated Query                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│            ┌─────────────┐                                  │
│            │   Athena    │                                  │
│            └──────┬──────┘                                  │
│                   │                                          │
│      ┌────────────┼────────────┬─────────────┐              │
│      ▼            ▼            ▼             ▼              │
│   ┌─────┐    ┌─────────┐  ┌─────────┐  ┌──────────┐        │
│   │ S3  │    │DynamoDB │  │  RDS    │  │ Redshift │        │
│   └─────┘    └─────────┘  └─────────┘  └──────────┘        │
│                                                              │
│   → Join across sources using Lambda connectors             │
└─────────────────────────────────────────────────────────────┘

Athena Spark

Apache Spark Integration

Athena Spark Features:
├── Notebook interface for Spark code
├── Python, Scala support
├── Complex data transformations
├── ML library access
└── DPU-hour pricing ($0.35/DPU-hour)

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

  1. Service Selection: "Analyze S3 logs with SQL" → Athena
  2. Cost Optimization: "Reduce Athena costs" → Partitioning + Parquet
  3. Service Comparison: "S3 Select vs Athena" → Single object vs multiple objects
  4. Integration: "Glue Catalog purpose" → Metadata store

Sample Exam Questions

Exam Tip

Sample Exam Question 1: "A company stores several TBs of logs in S3 daily. They need to analyze errors from specific dates only. How can they minimize costs?"

→ Answer: Date-based partitioning + Athena query (scan only required partitions)

Exam Tip

Sample Exam Question 2: "You need to extract specific records from a 1GB JSON file stored in S3. What is the most cost-effective approach?"

→ Answer: S3 Select (simple filtering from single object)


Frequently Asked Questions

Q: What's the difference between Athena and Redshift Spectrum?

Athena is a standalone serverless query service. Redshift Spectrum queries S3 data from within a Redshift cluster. Use Spectrum if you already have Redshift; use Athena for pure S3 analysis.

Q: Where are Athena query results stored?

Query results are stored as CSV files in a specified S3 bucket. The output bucket is required, and S3 storage costs apply separately.

Q: When is MSCK REPAIR TABLE needed?

When new partitions are added, they must be registered in Glue Catalog. MSCK REPAIR TABLE auto-discovers partitions, or use partition projection to eliminate this step entirely.

Q: Is Athena suitable for real-time queries?

No. Athena has second-to-minute response times, best for ad-hoc analysis. For millisecond responses, consider ElastiCache, DynamoDB, or provisioned Redshift.

Q: Can Athena perform UPDATE/DELETE operations?

Not by default. Athena is a read-only query service. However, using Apache Iceberg tables enables ACID transactions (INSERT, UPDATE, DELETE).


References