SAABlog
IntegrationIntermediate

Amazon Kinesis: Complete Guide to Real-Time Streaming Data Processing

Kinesis Data Streams vs Firehose differences, SQS comparison, and shard capacity planning. SAA-C03 exam essentials explained.

PHILOLAMB-
KinesisStreamingReal-time DataData StreamsFirehose

Related Exam Domains

  • Domain 3: Design High-Performing Architectures

Key Takeaway

Amazon Kinesis is a platform for collecting, processing, and analyzing real-time streaming data. Data Streams provides millisecond-latency real-time processing, while Firehose automatically delivers data to S3/Redshift in near real-time.

Exam Tip

Exam Essential: "Real-time streaming + multiple consumers" → Kinesis Data Streams, "Auto-deliver to S3/Redshift" → Firehose, "Message queuing + single consumer" → SQS


When Should You Use Kinesis?

Best For

Kinesis Recommended Scenarios:
├── Real-time log/event streaming
│   └── Web clickstreams, application logs
├── IoT sensor data ingestion
│   └── Thousands to millions of devices
├── Real-time analytics and dashboards
│   └── Live leaderboards, fraud detection
├── Multiple consumers processing same data
│   └── Analytics, storage, alerts in parallel
└── Data replay/reprocessing needed
    └── Up to 365 days retention

Not Ideal For

Cases Where Kinesis Isn't the Best Fit:
├── Single message processing (1:1 communication)
│   → Use SQS
├── Message fan-out (1:N push)
│   → Use SNS
├── Event-based routing
│   → Use EventBridge
└── Low throughput, sporadic messages
    → SQS/SNS combination is simpler and cheaper

Kinesis Service Types

The Four Services Compared

┌─────────────────────────────────────────────────────────────┐
│                    Amazon Kinesis Family                     │
├──────────────────┬──────────────────────────────────────────┤
│                  │                                          │
│  Data Streams    │  Real-time data streaming (ms latency)   │
│  ───────────     │  • Shard-based capacity management       │
│                  │  • 24hr-365day data retention            │
│                  │  • Multiple consumers simultaneously     │
│                  │                                          │
├──────────────────┼──────────────────────────────────────────┤
│                  │                                          │
│  Firehose        │  Data delivery service (near real-time)  │
│  ────────        │  • Auto-deliver to S3, Redshift, etc.    │
│                  │  • Auto-scaling, zero management         │
│                  │  • Lambda transformation supported       │
│                  │                                          │
├──────────────────┼──────────────────────────────────────────┤
│                  │                                          │
│  Data Analytics  │  SQL queries on streaming data           │
│  ──────────────  │  • Real-time aggregation, filtering      │
│                  │  • Apache Flink based                    │
│                  │                                          │
├──────────────────┼──────────────────────────────────────────┤
│                  │                                          │
│  Video Streams   │  Video streaming ingestion & processing  │
│  ─────────────   │  • Camera, CCTV video analysis           │
│                  │  • ML model integration                  │
│                  │                                          │
└──────────────────┴──────────────────────────────────────────┘

Kinesis Data Streams Core Concepts

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Kinesis Data Streams                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   [Producers]                                                │
│   (EC2, Mobile, IoT)                                        │
│       │                                                      │
│       ▼                                                      │
│   ┌──────────────────────────────────────────┐              │
│   │              Data Stream                  │              │
│   │  ┌────────┐ ┌────────┐ ┌────────┐       │              │
│   │  │ Shard 1│ │ Shard 2│ │ Shard 3│       │              │
│   │  │ 1MB/s  │ │ 1MB/s  │ │ 1MB/s  │       │              │
│   │  │ write  │ │ write  │ │ write  │       │              │
│   │  └────────┘ └────────┘ └────────┘       │              │
│   └──────────────────────────────────────────┘              │
│       │                                                      │
│       ▼                                                      │
│   [Consumers]                                                │
│   ┌────────┐ ┌────────┐ ┌────────┐                         │
│   │ Lambda │ │  KCL   │ │Firehose│                         │
│   │        │ │  App   │ │  → S3  │                         │
│   └────────┘ └────────┘ └────────┘                         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Shard Capacity

OperationCapacity Limit
Write (Ingestion)1MB/sec or 1,000 records/sec per shard
Read (Consumption)2MB/sec per shard
Read Transactions5 GetRecords calls/sec per shard

Exam Tip

Shard Calculation Example: If you need 5MB/sec writes, you need minimum 5 shards. Exam tip: "Throughput issues" → Increase shard count!

Partition Key

Partition Key Role:
├── Determines which shard receives data
├── MD5 hash function maps to shard
├── Max 256 Unicode characters
└── Same key → Same shard → Order guaranteed

Good Partition Key Examples:
├── user_id (per-user ordering)
├── device_id (per-device ordering)
└── session_id (per-session ordering)

Bad Partition Key:
└── Fixed value (e.g., "default") → Hot shard problem!

Data Retention Period

RetentionCostUse Case
24 hours (default)No extra chargeStandard real-time processing
7 days (extended)Additional GB-month feeReprocessing needed
365 days (long-term)Lower GB-month rateCompliance, audit

Capacity Modes

ModeFeaturesWhen to Choose
On-DemandAuto-scaling, pay-per-useUnpredictable traffic
ProvisionedSpecify shard count, hourly billingPredictable traffic, cost optimization

Kinesis Data Firehose

Key Features

Firehose Core Features:
├── Fully Managed (Serverless)
│   └── No shard management, auto-scaling
├── Near Real-Time Delivery
│   └── Minimum buffer time of 1 minute
├── Data Transformation Support
│   └── Lambda function transformation
│   └── Parquet/ORC format conversion
├── Automatic Compression
│   └── GZIP, Snappy, ZIP
└── No Data Storage
    └── Direct delivery to destination

Supported Destinations

Firehose Destinations:
├── AWS Services
│   ├── Amazon S3
│   ├── Amazon Redshift (via S3)
│   ├── Amazon OpenSearch Service
│   └── Apache Iceberg Tables
├── Third-Party
│   ├── Splunk
│   ├── Snowflake
│   ├── Datadog
│   └── MongoDB
└── Custom
    └── HTTP Endpoints

Data Streams vs Firehose: Which One Should You Choose?

Comparison Table

AspectData StreamsFirehose
LatencyMilliseconds (real-time)Minimum 1 minute (near real-time)
Data Storage24hr-365 day retentionNo storage (direct delivery)
ScalingManual (shard management)Automatic
ConsumersMultiple consumers supportedSpecified destinations only
Data TransformationConsumer handles itBuilt-in Lambda transformation
CostShard + data volumeData volume only
ComplexityHigher (shard design required)Lower

Decision Flow

Need real-time streaming data processing?
        │
        ▼
Need millisecond-level real-time processing?
        │
       Yes → Do multiple consumers need to process the same data?
        │           │
        │          Yes → [Kinesis Data Streams]
        │           │
        │          No → Need data replay capability?
        │                   │
        │                  Yes → [Kinesis Data Streams]
        │                   │
        │                  No → [Firehose is simpler]
        │
       No
        │
        ▼
Is auto-delivery to S3/Redshift/OpenSearch the goal?
        │
       Yes → [Kinesis Data Firehose]
        │
       No → [Consider other services (SQS, EventBridge)]

Exam Tip

Exam Keyword Mapping:

  • "Real-time analytics", "multiple applications consuming simultaneously" → Data Streams
  • "Auto-save to S3", "minimal operational overhead" → Firehose
  • "Data replay" → Data Streams (possible within retention period)

Kinesis vs SQS vs SNS: Which One Should You Choose?

Comparison Table

AspectKinesis Data StreamsSQSSNS
ModelPull-based streamingPull-based queuePush-based Pub/Sub
ConsumersMultiple (same data)Single per messageMultiple subscribers
Retention24hr-365 daysUp to 14 daysNo retention
OrderingPer partition keyFIFO queues onlyNone
ThroughputMillions of records/secThousands of messages/secHigh
ReplayPossibleNot possibleNot possible
CostHigherLowerLower

Selection Criteria

Choosing a Messaging/Streaming Service:
        │
        ▼
Is this high-volume real-time streaming data? (logs, IoT, clickstream)
        │
       Yes → Need data replay or multiple consumers?
        │           │
        │          Yes → [Kinesis Data Streams]
        │           │
        │          No → [Consider Firehose + SQS combination]
        │
       No
        │
        ▼
Need to send messages to multiple subscribers simultaneously? (fan-out)
        │
       Yes → [SNS] or [SNS + SQS combination]
        │
       No
        │
        ▼
Is async task decoupling/buffering the goal?
        │
       Yes → [SQS]
        │
       No → [EventBridge] (event routing)

Exam Tip

SQS vs Kinesis Key Difference:

  • SQS: Message is deleted after processing, other consumers cannot access it
  • Kinesis: Multiple consumers can read the same data, replay possible within retention period

Pricing Structure

Data Streams Pricing (US-East)

On-Demand Standard:

ItemPrice
Data ingestion$0.08/GB
Data retrieval$0.04/GB
Stream hour$0.04/hour/stream

Provisioned:

ItemPrice
Shard hour$0.015/shard/hour
PUT payload units$0.014/million units

Firehose Pricing

ItemPrice (US-East)
First 500TB$0.029/GB
Next 1.5PB$0.025/GB
Over 5PB$0.020/GB

Exam Tip

Cost Optimization:

  • Predictable workload → Provisioned mode is cheaper
  • Sporadic/unpredictable → On-Demand mode
  • Simple delivery only → Firehose cheaper than Data Streams

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

  1. Service Selection: "Real-time clickstream analytics" → Kinesis Data Streams
  2. Streams vs Firehose: "Auto-save to S3, minimize management" → Firehose
  3. Kinesis vs SQS: "Multiple applications processing same data" → Kinesis
  4. Capacity Calculation: "5MB/sec processing needed" → 5 shards required
  5. Data Retention: "Need 7-day replay capability" → Extended retention setting

Sample Exam Questions

Exam Tip

Sample Exam Question 1: "IoT sensors generate millions of events per second that need to be sent to both a real-time dashboard and S3 archive simultaneously. What is the most appropriate architecture?"

→ Answer: Kinesis Data Streams + 2 consumers (Lambda for dashboard + Firehose for S3)

Exam Tip

Sample Exam Question 2: "Web server logs need to be collected and stored in Amazon S3 in Parquet format. How can this be implemented with minimal operational overhead?"

→ Answer: Kinesis Data Firehose (S3 destination + Parquet conversion enabled)

Exam Tip

Sample Exam Question 3: "An order processing system requires each order to be processed exactly once. How should the system be protected during order spikes?"

→ Answer: SQS (single consumer, buffering role) - NOT Kinesis!


Frequently Asked Questions

Q: Can Kinesis Data Streams and Firehose be used together?

Yes. You can connect Firehose as a consumer to Data Streams. This is useful when you need both real-time processing (Data Streams consumers) and S3 archiving (Firehose) simultaneously.

Q: How do you determine the number of shards?

For writes: Required MB/s ÷ 1MB = Minimum shards For reads: Required MB/s ÷ 2MB = Minimum shards Choose the larger value.

Q: Is message ordering guaranteed in Kinesis?

Ordering is guaranteed per partition key. Records with the same partition key go to the same shard, and order is preserved within a shard. Stream-level ordering across all shards is not guaranteed.

Q: What is Enhanced Fan-Out?

By default, all consumers share 2MB/s per shard. With Enhanced Fan-Out, each consumer gets 2MB/s per shard independently. Useful when you have many consumers.

Q: Is Kinesis serverless?

Firehose is fully serverless. Data Streams requires shard management, but On-Demand mode provides automatic scaling.


References