Amazon Kinesis: Complete Guide to Real-Time Streaming Data Processing
Kinesis Data Streams vs Firehose differences, SQS comparison, and shard capacity planning. SAA-C03 exam essentials explained.
Related Exam Domains
- Domain 3: Design High-Performing Architectures
Key Takeaway
Amazon Kinesis is a platform for collecting, processing, and analyzing real-time streaming data. Data Streams provides millisecond-latency real-time processing, while Firehose automatically delivers data to S3/Redshift in near real-time.
Exam Tip
Exam Essential: "Real-time streaming + multiple consumers" → Kinesis Data Streams, "Auto-deliver to S3/Redshift" → Firehose, "Message queuing + single consumer" → SQS
When Should You Use Kinesis?
Best For
Kinesis Recommended Scenarios:
├── Real-time log/event streaming
│ └── Web clickstreams, application logs
├── IoT sensor data ingestion
│ └── Thousands to millions of devices
├── Real-time analytics and dashboards
│ └── Live leaderboards, fraud detection
├── Multiple consumers processing same data
│ └── Analytics, storage, alerts in parallel
└── Data replay/reprocessing needed
└── Up to 365 days retention
Not Ideal For
Cases Where Kinesis Isn't the Best Fit:
├── Single message processing (1:1 communication)
│ → Use SQS
├── Message fan-out (1:N push)
│ → Use SNS
├── Event-based routing
│ → Use EventBridge
└── Low throughput, sporadic messages
→ SQS/SNS combination is simpler and cheaper
Kinesis Service Types
The Four Services Compared
┌─────────────────────────────────────────────────────────────┐
│ Amazon Kinesis Family │
├──────────────────┬──────────────────────────────────────────┤
│ │ │
│ Data Streams │ Real-time data streaming (ms latency) │
│ ─────────── │ • Shard-based capacity management │
│ │ • 24hr-365day data retention │
│ │ • Multiple consumers simultaneously │
│ │ │
├──────────────────┼──────────────────────────────────────────┤
│ │ │
│ Firehose │ Data delivery service (near real-time) │
│ ──────── │ • Auto-deliver to S3, Redshift, etc. │
│ │ • Auto-scaling, zero management │
│ │ • Lambda transformation supported │
│ │ │
├──────────────────┼──────────────────────────────────────────┤
│ │ │
│ Data Analytics │ SQL queries on streaming data │
│ ────────────── │ • Real-time aggregation, filtering │
│ │ • Apache Flink based │
│ │ │
├──────────────────┼──────────────────────────────────────────┤
│ │ │
│ Video Streams │ Video streaming ingestion & processing │
│ ───────────── │ • Camera, CCTV video analysis │
│ │ • ML model integration │
│ │ │
└──────────────────┴──────────────────────────────────────────┘
Kinesis Data Streams Core Concepts
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Kinesis Data Streams │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Producers] │
│ (EC2, Mobile, IoT) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Data Stream │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ Shard 1│ │ Shard 2│ │ Shard 3│ │ │
│ │ │ 1MB/s │ │ 1MB/s │ │ 1MB/s │ │ │
│ │ │ write │ │ write │ │ write │ │ │
│ │ └────────┘ └────────┘ └────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ [Consumers] │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Lambda │ │ KCL │ │Firehose│ │
│ │ │ │ App │ │ → S3 │ │
│ └────────┘ └────────┘ └────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Shard Capacity
| Operation | Capacity Limit |
|---|---|
| Write (Ingestion) | 1MB/sec or 1,000 records/sec per shard |
| Read (Consumption) | 2MB/sec per shard |
| Read Transactions | 5 GetRecords calls/sec per shard |
Exam Tip
Shard Calculation Example: If you need 5MB/sec writes, you need minimum 5 shards. Exam tip: "Throughput issues" → Increase shard count!
Partition Key
Partition Key Role:
├── Determines which shard receives data
├── MD5 hash function maps to shard
├── Max 256 Unicode characters
└── Same key → Same shard → Order guaranteed
Good Partition Key Examples:
├── user_id (per-user ordering)
├── device_id (per-device ordering)
└── session_id (per-session ordering)
Bad Partition Key:
└── Fixed value (e.g., "default") → Hot shard problem!
Data Retention Period
| Retention | Cost | Use Case |
|---|---|---|
| 24 hours (default) | No extra charge | Standard real-time processing |
| 7 days (extended) | Additional GB-month fee | Reprocessing needed |
| 365 days (long-term) | Lower GB-month rate | Compliance, audit |
Capacity Modes
| Mode | Features | When to Choose |
|---|---|---|
| On-Demand | Auto-scaling, pay-per-use | Unpredictable traffic |
| Provisioned | Specify shard count, hourly billing | Predictable traffic, cost optimization |
Kinesis Data Firehose
Key Features
Firehose Core Features:
├── Fully Managed (Serverless)
│ └── No shard management, auto-scaling
├── Near Real-Time Delivery
│ └── Minimum buffer time of 1 minute
├── Data Transformation Support
│ └── Lambda function transformation
│ └── Parquet/ORC format conversion
├── Automatic Compression
│ └── GZIP, Snappy, ZIP
└── No Data Storage
└── Direct delivery to destination
Supported Destinations
Firehose Destinations:
├── AWS Services
│ ├── Amazon S3
│ ├── Amazon Redshift (via S3)
│ ├── Amazon OpenSearch Service
│ └── Apache Iceberg Tables
├── Third-Party
│ ├── Splunk
│ ├── Snowflake
│ ├── Datadog
│ └── MongoDB
└── Custom
└── HTTP Endpoints
Data Streams vs Firehose: Which One Should You Choose?
Comparison Table
| Aspect | Data Streams | Firehose |
|---|---|---|
| Latency | Milliseconds (real-time) | Minimum 1 minute (near real-time) |
| Data Storage | 24hr-365 day retention | No storage (direct delivery) |
| Scaling | Manual (shard management) | Automatic |
| Consumers | Multiple consumers supported | Specified destinations only |
| Data Transformation | Consumer handles it | Built-in Lambda transformation |
| Cost | Shard + data volume | Data volume only |
| Complexity | Higher (shard design required) | Lower |
Decision Flow
Need real-time streaming data processing?
│
▼
Need millisecond-level real-time processing?
│
Yes → Do multiple consumers need to process the same data?
│ │
│ Yes → [Kinesis Data Streams]
│ │
│ No → Need data replay capability?
│ │
│ Yes → [Kinesis Data Streams]
│ │
│ No → [Firehose is simpler]
│
No
│
▼
Is auto-delivery to S3/Redshift/OpenSearch the goal?
│
Yes → [Kinesis Data Firehose]
│
No → [Consider other services (SQS, EventBridge)]
Exam Tip
Exam Keyword Mapping:
- "Real-time analytics", "multiple applications consuming simultaneously" → Data Streams
- "Auto-save to S3", "minimal operational overhead" → Firehose
- "Data replay" → Data Streams (possible within retention period)
Kinesis vs SQS vs SNS: Which One Should You Choose?
Comparison Table
| Aspect | Kinesis Data Streams | SQS | SNS |
|---|---|---|---|
| Model | Pull-based streaming | Pull-based queue | Push-based Pub/Sub |
| Consumers | Multiple (same data) | Single per message | Multiple subscribers |
| Retention | 24hr-365 days | Up to 14 days | No retention |
| Ordering | Per partition key | FIFO queues only | None |
| Throughput | Millions of records/sec | Thousands of messages/sec | High |
| Replay | Possible | Not possible | Not possible |
| Cost | Higher | Lower | Lower |
Selection Criteria
Choosing a Messaging/Streaming Service:
│
▼
Is this high-volume real-time streaming data? (logs, IoT, clickstream)
│
Yes → Need data replay or multiple consumers?
│ │
│ Yes → [Kinesis Data Streams]
│ │
│ No → [Consider Firehose + SQS combination]
│
No
│
▼
Need to send messages to multiple subscribers simultaneously? (fan-out)
│
Yes → [SNS] or [SNS + SQS combination]
│
No
│
▼
Is async task decoupling/buffering the goal?
│
Yes → [SQS]
│
No → [EventBridge] (event routing)
Exam Tip
SQS vs Kinesis Key Difference:
- SQS: Message is deleted after processing, other consumers cannot access it
- Kinesis: Multiple consumers can read the same data, replay possible within retention period
Pricing Structure
Data Streams Pricing (US-East)
On-Demand Standard:
| Item | Price |
|---|---|
| Data ingestion | $0.08/GB |
| Data retrieval | $0.04/GB |
| Stream hour | $0.04/hour/stream |
Provisioned:
| Item | Price |
|---|---|
| Shard hour | $0.015/shard/hour |
| PUT payload units | $0.014/million units |
Firehose Pricing
| Item | Price (US-East) |
|---|---|
| First 500TB | $0.029/GB |
| Next 1.5PB | $0.025/GB |
| Over 5PB | $0.020/GB |
Exam Tip
Cost Optimization:
- Predictable workload → Provisioned mode is cheaper
- Sporadic/unpredictable → On-Demand mode
- Simple delivery only → Firehose cheaper than Data Streams
SAA-C03 Exam Focus Points
Commonly Tested Scenarios
- ✅ Service Selection: "Real-time clickstream analytics" → Kinesis Data Streams
- ✅ Streams vs Firehose: "Auto-save to S3, minimize management" → Firehose
- ✅ Kinesis vs SQS: "Multiple applications processing same data" → Kinesis
- ✅ Capacity Calculation: "5MB/sec processing needed" → 5 shards required
- ✅ Data Retention: "Need 7-day replay capability" → Extended retention setting
Sample Exam Questions
Exam Tip
Sample Exam Question 1: "IoT sensors generate millions of events per second that need to be sent to both a real-time dashboard and S3 archive simultaneously. What is the most appropriate architecture?"
→ Answer: Kinesis Data Streams + 2 consumers (Lambda for dashboard + Firehose for S3)
Exam Tip
Sample Exam Question 2: "Web server logs need to be collected and stored in Amazon S3 in Parquet format. How can this be implemented with minimal operational overhead?"
→ Answer: Kinesis Data Firehose (S3 destination + Parquet conversion enabled)
Exam Tip
Sample Exam Question 3: "An order processing system requires each order to be processed exactly once. How should the system be protected during order spikes?"
→ Answer: SQS (single consumer, buffering role) - NOT Kinesis!
Frequently Asked Questions
Q: Can Kinesis Data Streams and Firehose be used together?
Yes. You can connect Firehose as a consumer to Data Streams. This is useful when you need both real-time processing (Data Streams consumers) and S3 archiving (Firehose) simultaneously.
Q: How do you determine the number of shards?
For writes: Required MB/s ÷ 1MB = Minimum shards For reads: Required MB/s ÷ 2MB = Minimum shards Choose the larger value.
Q: Is message ordering guaranteed in Kinesis?
Ordering is guaranteed per partition key. Records with the same partition key go to the same shard, and order is preserved within a shard. Stream-level ordering across all shards is not guaranteed.
Q: What is Enhanced Fan-Out?
By default, all consumers share 2MB/s per shard. With Enhanced Fan-Out, each consumer gets 2MB/s per shard independently. Useful when you have many consumers.
Q: Is Kinesis serverless?
Firehose is fully serverless. Data Streams requires shard management, but On-Demand mode provides automatic scaling.
Related Posts
- SQS vs SNS vs EventBridge Comparison
- Lambda Function Optimization
- Step Functions Workflow Orchestration