CloudWatch Complete Guide: Metrics, Alarms, and Log Monitoring
Learn how to monitor AWS infrastructure using Amazon CloudWatch metrics, alarms, logs, and dashboards. Master CloudWatch for SAA-C03 exam.
Related Exam Domains
- Domain 2: Design Resilient Architectures
Key Takeaway
Amazon CloudWatch is a monitoring service that collects metrics from AWS resources, sets alarms, and centrally manages logs. EC2 basic monitoring is 5-minute intervals, detailed monitoring is 1-minute, and memory/disk metrics require the CloudWatch agent.
Exam Tip
Exam Essential: "Performance monitoring = CloudWatch", "API auditing = CloudTrail", "Configuration tracking = AWS Config". EC2 memory/disk are NOT included in default metrics!
CloudWatch Core Components
[Amazon CloudWatch]
│
├── Metrics
│ ├── AWS service default metrics
│ └── Custom metrics
│
├── Alarms
│ ├── Metric alarms
│ └── Composite alarms
│
├── Logs
│ ├── Log groups / Log streams
│ └── Logs Insights (query)
│
├── Dashboards
│
└── Events / EventBridge
Metrics
EC2 Default Metrics
| Metric | Description | Included by Default |
|---|---|---|
| CPUUtilization | CPU usage percentage | ✅ |
| NetworkIn/Out | Network traffic | ✅ |
| DiskReadOps/WriteOps | Disk I/O operations | ✅ |
| StatusCheckFailed | Status check failures | ✅ |
| MemoryUtilization | Memory usage | ❌ Agent required |
| DiskSpaceUtilization | Disk space usage | ❌ Agent required |
Exam Tip
Exam Favorite: EC2 memory utilization and disk space are NOT included in default metrics. You must install the CloudWatch agent and collect them as custom metrics.
Basic vs Detailed Monitoring
| Item | Basic Monitoring | Detailed Monitoring |
|---|---|---|
| Collection Interval | 5 minutes | 1 minute |
| Cost | Free | Paid |
| Activation | Default | Manual |
| Use Case | General monitoring | Fast Auto Scaling response |
Custom Metrics
Send custom metrics via CloudWatch agent or API:
aws cloudwatch put-metric-data \
--namespace "MyApp" \
--metric-name "ActiveUsers" \
--value 150 \
--unit Count
→ Resolution: Standard (60 seconds) or High-resolution (1 second)
→ High-resolution metrics incur additional costs
Key Metrics by Service
| Service | Key Metrics |
|---|---|
| EC2 | CPUUtilization, StatusCheck, NetworkIn/Out |
| RDS | CPUUtilization, FreeableMemory, ReadIOPS |
| ELB | RequestCount, HealthyHostCount, Latency |
| Lambda | Invocations, Duration, Errors, Throttles |
| S3 | BucketSizeBytes, NumberOfObjects |
| SQS | ApproximateNumberOfMessages, ApproximateAgeOfOldestMessage |
Alarms
Alarm States
3 States:
┌─────────┐ ┌─────────┐ ┌──────────────────┐
│ OK │ → │ ALARM │ → │ INSUFFICIENT_DATA │
│ (Normal)│ │ (Alert) │ │ (No data) │
└─────────┘ └─────────┘ └──────────────────┘
Alarm Actions
| Action | Description |
|---|---|
| SNS Notification | Email, SMS, Lambda trigger |
| Auto Scaling | Trigger scale out/in |
| EC2 Actions | Stop, terminate, reboot, recover instance |
Alarm Configuration Example:
Metric: CPUUtilization
Condition: Average > 80% for 5 minutes
Actions:
ALARM → SNS notification + Auto Scaling scale out
OK → SNS notification (recovery alert)
Composite Alarms
Combine multiple alarms with AND/OR:
Composite Alarm: "Service Failure"
= CPU Alarm (ALARM)
AND Error Rate Alarm (ALARM)
AND Latency Alarm (ALARM)
→ Notifies only when all 3 are in ALARM state
→ Reduces alarm noise
Logs (CloudWatch Logs)
Structure
CloudWatch Logs Hierarchy:
┌─────────────────────────┐
│ Log Group │ ← /aws/lambda/my-function
│ ┌───────────────────┐ │
│ │ Log Stream 1 │ │ ← Per instance/container
│ │ Log Stream 2 │ │
│ │ Log Stream 3 │ │
│ └───────────────────┘ │
│ Retention: 1 day ~ forever │
└─────────────────────────┘
Log Sources
| Source | Configuration Method |
|---|---|
| EC2 | Install CloudWatch agent |
| Lambda | Automatic (IAM permission only) |
| ECS/Fargate | awslogs log driver |
| API Gateway | Enable in stage settings |
| CloudTrail | CloudWatch Logs integration |
| Route 53 | DNS query logging |
| VPC Flow Logs | Enable in VPC settings |
Logs Insights
CloudWatch Logs Insights Query Examples:
# Query error logs from last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
# Lambda function p99 latency
filter @type = "REPORT"
| stats percentile(@duration, 99) as p99
by bin(1h)
Log Export
CloudWatch Logs → S3: Batch export (NOT real-time ❌)
CloudWatch Logs → Kinesis Data Firehose: Real-time streaming ✅
CloudWatch Logs → Lambda: Real-time processing
CloudWatch Logs → OpenSearch: Real-time analytics
Exam Tip
Real-time Log Export: S3 export is batch (up to 12-hour delay). For real-time, use Kinesis Data Firehose subscription filters.
CloudWatch Agent
When CloudWatch Agent is Needed:
- Collect EC2 memory, disk metrics
- Send EC2 internal log files to CloudWatch Logs
- Monitor on-premises servers
Installation Flow:
[EC2/On-premises] → Install CloudWatch Agent
→ IAM role (EC2) or credentials (on-premises)
→ Metrics + Logs → CloudWatch
CloudWatch vs CloudTrail vs AWS Config
| Item | CloudWatch | CloudTrail | AWS Config |
|---|---|---|---|
| Purpose | Performance monitoring | API audit logging | Configuration tracking |
| Question | "How's performance?" | "Who did it?" | "What changed?" |
| Data | Metrics, logs | API call records | Resource config history |
| Example | CPU > 80% alert | Track who terminated EC2 | Detect SG rule changes |
| Retention | Metrics 15 months | Events 90 days (S3 unlimited) | Config history unlimited |
SAA-C03 Exam Focus Points
- ✅ Memory/Disk Metrics: "EC2 memory monitoring = CloudWatch agent required"
- ✅ Detailed Monitoring: "1-minute interval = enable detailed monitoring"
- ✅ Alarm Actions: "Auto scale when CPU high = CloudWatch alarm + Auto Scaling"
- ✅ Real-time Log Export: "S3 is batch, real-time = Kinesis Data Firehose"
- ✅ vs CloudTrail: "Performance = CloudWatch, Audit = CloudTrail"
Exam Tip
Sample Exam Question: "You need automatic notification when EC2 instance memory utilization exceeds 90%. What is the solution?" → Answer: Install CloudWatch agent → Memory custom metric → CloudWatch alarm → SNS notification
Frequently Asked Questions
Q: Is CloudWatch free?
Basic monitoring (5-minute intervals), 10 metric alarms, and 10 custom metrics are included in the free tier. Detailed monitoring, additional alarms, dashboards, and Logs Insights queries are charged.
Q: How long is metric data retained?
Depends on resolution. High-resolution (1 second) is retained for 3 hours, 60-second data for 15 days, 5-minute data for 63 days, and 1-hour data for 15 months.
Q: Can I monitor on-premises servers?
Yes. Install the CloudWatch agent on on-premises servers and configure IAM credentials to send metrics and logs to CloudWatch.
Q: Why is my CloudWatch alarm in INSUFFICIENT_DATA state?
This occurs when the alarm was just created, metric data is missing, or the metric namespace is incorrect. Check the missing data treatment settings.
Q: Should I store logs in CloudWatch Logs or S3?
Use CloudWatch Logs for real-time monitoring and Logs Insights queries. Use S3 for long-term retention and cost savings. Typically both are used together.
Related Posts
- Auto Scaling Group Setup and Policies
- ELB Types Comparison (ALB, NLB, GLB, CLB)
- Lambda Function Optimization