AWS Fault Injection Simulator: Testing System Resilience with Chaos Engineering

Key Takeaway

AWS Fault Injection Simulator (FIS) is a managed chaos engineering service that injects controlled faults into AWS workloads to test application resilience. You can simulate EC2 failures, AZ outages, network latency, and more.

Exam Tip

Exam Essential: "Test application resilience" → AWS FIS, "Fault injection experiments" → AWS FIS, "Game day simulations" → AWS FIS

When Should You Use AWS FIS?

Best For

AWS FIS Recommended Scenarios:
├── Resilience validation
│   └── Test if system behaves correctly during failures
├── Game Day simulations
│   └── Recreate real failure scenarios and practice response
├── CI/CD pipeline integration
│   └── Automated resilience testing before deployment
├── AZ/Region failure preparation
│   └── Validate Multi-AZ, Multi-Region architectures
└── Monitoring/alarm validation
    └── Verify CloudWatch alarms trigger correctly during failures

Not Ideal For

Cases Where AWS FIS Isn't the Best Fit:
├── First production deployment (test in staging first)
├── Systems without resilience architecture
│   → First configure Multi-AZ, Auto Scaling
├── Simple functional testing
│   → Use standard testing tools
└── Performance/load testing only
    → Use AWS Distributed Load Testing

AWS FIS Core Concepts

Architecture

┌─────────────────────────────────────────────────────────────┐
│                AWS Fault Injection Simulator                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ┌───────────────────────────────────────────────────┐     │
│   │              Experiment Template                    │     │
│   │  ┌─────────────────────────────────────────────┐  │     │
│   │  │ Actions                                      │  │     │
│   │  │  • aws:ec2:stop-instances                   │  │     │
│   │  │  • aws:ec2:terminate-instances              │  │     │
│   │  │  • aws:ecs:stop-task                        │  │     │
│   │  │  • aws:rds:failover-db-cluster              │  │     │
│   │  └─────────────────────────────────────────────┘  │     │
│   │  ┌─────────────────────────────────────────────┐  │     │
│   │  │ Targets                                      │  │     │
│   │  │  • Resource IDs                              │  │     │
│   │  │  • Tag filters                               │  │     │
│   │  │  • Resource types                            │  │     │
│   │  └─────────────────────────────────────────────┘  │     │
│   │  ┌─────────────────────────────────────────────┐  │     │
│   │  │ Stop Conditions                              │  │     │
│   │  │  • CloudWatch alarm integration              │  │     │
│   │  │  • Auto rollback on threshold breach         │  │     │
│   │  └─────────────────────────────────────────────┘  │     │
│   └───────────────────────────────────────────────────┘     │
│                           │                                  │
│                           ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │              Experiment Execution                  │     │
│   │                                                    │     │
│   │  [Target Resources: EC2, ECS, RDS, EKS...]        │     │
│   │                                                    │     │
│   └───────────────────────────────────────────────────┘     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Terms

Term	Description
Experiment Template	Blueprint defining actions, targets, and stop conditions
Action	Fault injection operation (EC2 stop, network latency, etc.)
Target	AWS resources to inject faults into (ID, tags, filters)
Stop Condition	Auto-stop conditions (CloudWatch alarm integration)
Scenario Library	Pre-built experiment templates

Supported Actions

Major Fault Types

AWS FIS Supported Actions:
├── EC2
│   ├── aws:ec2:stop-instances
│   ├── aws:ec2:terminate-instances
│   ├── aws:ec2:reboot-instances
│   └── aws:ec2:send-spot-instance-interruptions
│
├── ECS
│   ├── aws:ecs:stop-task
│   └── aws:ecs:drain-container-instances
│
├── EKS
│   ├── aws:eks:terminate-nodegroup-instances
│   └── aws:eks:pod-delete
│
├── RDS
│   ├── aws:rds:failover-db-cluster
│   └── aws:rds:reboot-db-instances
│
├── Network
│   ├── aws:network:disrupt-connectivity
│   └── aws:ssm:send-command (inject network latency)
│
└── AZ Level
    └── aws:ec2:asg-insufficient-instance-capacity
        (simulate AZ capacity issues)

Designing FIS Experiments

Guardrails for Safe Experiments

Experiment Safety Measures:
├── Set Stop Conditions (Required!)
│   ├── Link CloudWatch alarms
│   │   └── Error rate > 5% → auto-stop experiment
│   ├── Response time thresholds
│   │   └── P99 latency > 3s → stop experiment
│   └── Service availability alarms
│
├── Limit Target Scope
│   ├── Tag filters to target test environments only
│   ├── Specify resource percentage (e.g., only 30%)
│   └── Target specific AZs only
│
└── Progressive Experiments
    ├── Phase 1: Test in development environment
    ├── Phase 2: Test in staging environment
    └── Phase 3: Limited scope in production

Exam Tip

Exam Point: FIS experiments must have Stop Conditions configured. Link CloudWatch alarms to automatically stop experiments when thresholds are breached.

Common Experiment Scenarios

Scenario	Action	Validation Point
EC2 Instance Failure	stop-instances	Does Auto Scaling launch new instances?
AZ Failure	Stop all instances in specific AZ	Multi-AZ failover behavior
RDS Failover	failover-db-cluster	Application auto-reconnection
ECS Task Failure	stop-task	Does service restart tasks?
Network Latency	SSM tc command	Timeout/retry logic behavior

CI/CD Pipeline Integration

Automated Resilience Testing

┌─────────────────────────────────────────────────────────────┐
│              CI/CD + FIS Integration Architecture            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   [Code Commit] → [Build] → [Deploy to Staging]             │
│                                    │                         │
│                                    ▼                         │
│                        ┌─────────────────┐                  │
│                        │   AWS FIS       │                  │
│                        │   Run Experiment│                  │
│                        └─────────────────┘                  │
│                                    │                         │
│                          ┌────────┴────────┐                │
│                          ▼                 ▼                │
│                      [Pass]            [Fail]               │
│                          │                 │                │
│                          ▼                 ▼                │
│               [Deploy to Prod]     [Rollback + Alert]       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Pricing Structure

Pricing (US East)

Item	Cost
Action-minute	$0.10/action-minute

Example: Stopping 5 EC2 instances for 10 minutes

5 actions × 10 minutes = 50 action-minutes
Cost: 50 × $0.10 = $5.00

Cost Optimization Tips

Cost Reduction Strategies:
├── Short experiment duration
│   └── Run only minimum time needed for validation
├── Minimize target resources
│   └── Sample resources, not entire fleet
├── Use dev/test environments
│   └── Validate in non-production before production
└── Leverage Scenario Library
    └── Pre-built templates reduce trial and error

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

✅ Resilience Testing: "Validate system behavior during failures" → AWS FIS
✅ Game Day: "Recreate real failure scenarios" → AWS FIS
✅ AZ Failure Simulation: "Verify behavior during specific AZ failure" → AWS FIS
✅ Automated Testing: "Resilience testing in CI/CD" → AWS FIS
✅ Safety Measures: "Stop conditions, CloudWatch alarm integration" → FIS Stop Conditions

Sample Exam Questions

Exam Tip

Sample Exam Question 1: "How can you verify that a Multi-AZ deployed application actually works correctly during an AZ failure?"

→ Answer: AWS Fault Injection Simulator (simulate AZ failure, verify failover behavior)

Exam Tip

Sample Exam Question 2: "How can you automatically stop an FIS experiment if application error rate spikes?"

→ Answer: Link CloudWatch alarm as FIS experiment Stop Condition

Exam Tip

Sample Exam Question 3: "What measures are needed to safely run chaos engineering experiments in production?"

→ Answer: Configure Stop Conditions + Limit target resource scope + Progressive experiments

AWS Fault Injection Simulator: Testing System Resilience with Chaos Engineering

Key Takeaway

When Should You Use AWS FIS?

Best For

Not Ideal For

AWS FIS Core Concepts

Architecture

Key Terms

Supported Actions

Major Fault Types

Designing FIS Experiments

Guardrails for Safe Experiments

Common Experiment Scenarios

CI/CD Pipeline Integration

Automated Resilience Testing

Pricing Structure

Pricing (US East)

Cost Optimization Tips

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

Sample Exam Questions

Frequently Asked Questions

Q: What's the difference between FIS and third-party chaos engineering tools?

Q: Does FIS actually terminate resources?

Q: What if something goes wrong during an experiment?

Q: Which regions support FIS?

Q: Can FIS corrupt data?

References