AWS Fault Injection Simulator: Testing System Resilience with Chaos Engineering
How to run fault injection experiments with AWS FIS. EC2, ECS, RDS failure simulations and SAA-C03 exam essentials explained.
Related Exam Domains
- Domain 2: Design Resilient Architectures
Key Takeaway
AWS Fault Injection Simulator (FIS) is a managed chaos engineering service that injects controlled faults into AWS workloads to test application resilience. You can simulate EC2 failures, AZ outages, network latency, and more.
Exam Tip
Exam Essential: "Test application resilience" → AWS FIS, "Fault injection experiments" → AWS FIS, "Game day simulations" → AWS FIS
When Should You Use AWS FIS?
Best For
AWS FIS Recommended Scenarios:
├── Resilience validation
│ └── Test if system behaves correctly during failures
├── Game Day simulations
│ └── Recreate real failure scenarios and practice response
├── CI/CD pipeline integration
│ └── Automated resilience testing before deployment
├── AZ/Region failure preparation
│ └── Validate Multi-AZ, Multi-Region architectures
└── Monitoring/alarm validation
└── Verify CloudWatch alarms trigger correctly during failures
Not Ideal For
Cases Where AWS FIS Isn't the Best Fit:
├── First production deployment (test in staging first)
├── Systems without resilience architecture
│ → First configure Multi-AZ, Auto Scaling
├── Simple functional testing
│ → Use standard testing tools
└── Performance/load testing only
→ Use AWS Distributed Load Testing
AWS FIS Core Concepts
Architecture
┌─────────────────────────────────────────────────────────────┐
│ AWS Fault Injection Simulator │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Experiment Template │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Actions │ │ │
│ │ │ • aws:ec2:stop-instances │ │ │
│ │ │ • aws:ec2:terminate-instances │ │ │
│ │ │ • aws:ecs:stop-task │ │ │
│ │ │ • aws:rds:failover-db-cluster │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Targets │ │ │
│ │ │ • Resource IDs │ │ │
│ │ │ • Tag filters │ │ │
│ │ │ • Resource types │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Stop Conditions │ │ │
│ │ │ • CloudWatch alarm integration │ │ │
│ │ │ • Auto rollback on threshold breach │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Experiment Execution │ │
│ │ │ │
│ │ [Target Resources: EC2, ECS, RDS, EKS...] │ │
│ │ │ │
│ └───────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Key Terms
| Term | Description |
|---|---|
| Experiment Template | Blueprint defining actions, targets, and stop conditions |
| Action | Fault injection operation (EC2 stop, network latency, etc.) |
| Target | AWS resources to inject faults into (ID, tags, filters) |
| Stop Condition | Auto-stop conditions (CloudWatch alarm integration) |
| Scenario Library | Pre-built experiment templates |
Supported Actions
Major Fault Types
AWS FIS Supported Actions:
├── EC2
│ ├── aws:ec2:stop-instances
│ ├── aws:ec2:terminate-instances
│ ├── aws:ec2:reboot-instances
│ └── aws:ec2:send-spot-instance-interruptions
│
├── ECS
│ ├── aws:ecs:stop-task
│ └── aws:ecs:drain-container-instances
│
├── EKS
│ ├── aws:eks:terminate-nodegroup-instances
│ └── aws:eks:pod-delete
│
├── RDS
│ ├── aws:rds:failover-db-cluster
│ └── aws:rds:reboot-db-instances
│
├── Network
│ ├── aws:network:disrupt-connectivity
│ └── aws:ssm:send-command (inject network latency)
│
└── AZ Level
└── aws:ec2:asg-insufficient-instance-capacity
(simulate AZ capacity issues)
Designing FIS Experiments
Guardrails for Safe Experiments
Experiment Safety Measures:
├── Set Stop Conditions (Required!)
│ ├── Link CloudWatch alarms
│ │ └── Error rate > 5% → auto-stop experiment
│ ├── Response time thresholds
│ │ └── P99 latency > 3s → stop experiment
│ └── Service availability alarms
│
├── Limit Target Scope
│ ├── Tag filters to target test environments only
│ ├── Specify resource percentage (e.g., only 30%)
│ └── Target specific AZs only
│
└── Progressive Experiments
├── Phase 1: Test in development environment
├── Phase 2: Test in staging environment
└── Phase 3: Limited scope in production
Exam Tip
Exam Point: FIS experiments must have Stop Conditions configured. Link CloudWatch alarms to automatically stop experiments when thresholds are breached.
Common Experiment Scenarios
| Scenario | Action | Validation Point |
|---|---|---|
| EC2 Instance Failure | stop-instances | Does Auto Scaling launch new instances? |
| AZ Failure | Stop all instances in specific AZ | Multi-AZ failover behavior |
| RDS Failover | failover-db-cluster | Application auto-reconnection |
| ECS Task Failure | stop-task | Does service restart tasks? |
| Network Latency | SSM tc command | Timeout/retry logic behavior |
CI/CD Pipeline Integration
Automated Resilience Testing
┌─────────────────────────────────────────────────────────────┐
│ CI/CD + FIS Integration Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Code Commit] → [Build] → [Deploy to Staging] │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ AWS FIS │ │
│ │ Run Experiment│ │
│ └─────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ [Pass] [Fail] │
│ │ │ │
│ ▼ ▼ │
│ [Deploy to Prod] [Rollback + Alert] │
│ │
└─────────────────────────────────────────────────────────────┘
Pricing Structure
Pricing (US East)
| Item | Cost |
|---|---|
| Action-minute | $0.10/action-minute |
Example: Stopping 5 EC2 instances for 10 minutes
- 5 actions × 10 minutes = 50 action-minutes
- Cost: 50 × $0.10 = $5.00
Cost Optimization Tips
Cost Reduction Strategies:
├── Short experiment duration
│ └── Run only minimum time needed for validation
├── Minimize target resources
│ └── Sample resources, not entire fleet
├── Use dev/test environments
│ └── Validate in non-production before production
└── Leverage Scenario Library
└── Pre-built templates reduce trial and error
SAA-C03 Exam Focus Points
Commonly Tested Scenarios
- ✅ Resilience Testing: "Validate system behavior during failures" → AWS FIS
- ✅ Game Day: "Recreate real failure scenarios" → AWS FIS
- ✅ AZ Failure Simulation: "Verify behavior during specific AZ failure" → AWS FIS
- ✅ Automated Testing: "Resilience testing in CI/CD" → AWS FIS
- ✅ Safety Measures: "Stop conditions, CloudWatch alarm integration" → FIS Stop Conditions
Sample Exam Questions
Exam Tip
Sample Exam Question 1: "How can you verify that a Multi-AZ deployed application actually works correctly during an AZ failure?"
→ Answer: AWS Fault Injection Simulator (simulate AZ failure, verify failover behavior)
Exam Tip
Sample Exam Question 2: "How can you automatically stop an FIS experiment if application error rate spikes?"
→ Answer: Link CloudWatch alarm as FIS experiment Stop Condition
Exam Tip
Sample Exam Question 3: "What measures are needed to safely run chaos engineering experiments in production?"
→ Answer: Configure Stop Conditions + Limit target resource scope + Progressive experiments
Frequently Asked Questions
Q: What's the difference between FIS and third-party chaos engineering tools?
AWS FIS integrates natively with AWS services, making IAM permissions, CloudWatch alarms, and target filtering easier. Third-party tools like Chaos Monkey and Gremlin support multi-cloud, but FIS has stronger AWS-specific features.
Q: Does FIS actually terminate resources?
Yes, it actually affects resources. The EC2 stop action actually stops instances. Always test in non-production environments first, and run with limited scope in production.
Q: What if something goes wrong during an experiment?
If Stop Conditions trigger, FIS automatically stops the experiment. You can also manually stop via console/CLI. Note that already-terminated instances must be recovered by Auto Scaling.
Q: Which regions support FIS?
FIS is available in most AWS commercial regions. Check AWS documentation for new region availability.
Q: Can FIS corrupt data?
FIS itself doesn't directly corrupt data. However, if EBS is configured to delete on instance termination, data loss can occur. Verify data backups and EBS settings before experiments.