SAABlog
Intermediate

AWS Fault Injection Simulator: Testing System Resilience with Chaos Engineering

How to run fault injection experiments with AWS FIS. EC2, ECS, RDS failure simulations and SAA-C03 exam essentials explained.

PHILOLAMB-
Fault Injection SimulatorChaos EngineeringResilienceFault TestingFIS

Related Exam Domains

  • Domain 2: Design Resilient Architectures

Key Takeaway

AWS Fault Injection Simulator (FIS) is a managed chaos engineering service that injects controlled faults into AWS workloads to test application resilience. You can simulate EC2 failures, AZ outages, network latency, and more.

Exam Tip

Exam Essential: "Test application resilience" → AWS FIS, "Fault injection experiments" → AWS FIS, "Game day simulations" → AWS FIS


When Should You Use AWS FIS?

Best For

AWS FIS Recommended Scenarios:
├── Resilience validation
│   └── Test if system behaves correctly during failures
├── Game Day simulations
│   └── Recreate real failure scenarios and practice response
├── CI/CD pipeline integration
│   └── Automated resilience testing before deployment
├── AZ/Region failure preparation
│   └── Validate Multi-AZ, Multi-Region architectures
└── Monitoring/alarm validation
    └── Verify CloudWatch alarms trigger correctly during failures

Not Ideal For

Cases Where AWS FIS Isn't the Best Fit:
├── First production deployment (test in staging first)
├── Systems without resilience architecture
│   → First configure Multi-AZ, Auto Scaling
├── Simple functional testing
│   → Use standard testing tools
└── Performance/load testing only
    → Use AWS Distributed Load Testing

AWS FIS Core Concepts

Architecture

┌─────────────────────────────────────────────────────────────┐
│                AWS Fault Injection Simulator                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ┌───────────────────────────────────────────────────┐     │
│   │              Experiment Template                    │     │
│   │  ┌─────────────────────────────────────────────┐  │     │
│   │  │ Actions                                      │  │     │
│   │  │  • aws:ec2:stop-instances                   │  │     │
│   │  │  • aws:ec2:terminate-instances              │  │     │
│   │  │  • aws:ecs:stop-task                        │  │     │
│   │  │  • aws:rds:failover-db-cluster              │  │     │
│   │  └─────────────────────────────────────────────┘  │     │
│   │  ┌─────────────────────────────────────────────┐  │     │
│   │  │ Targets                                      │  │     │
│   │  │  • Resource IDs                              │  │     │
│   │  │  • Tag filters                               │  │     │
│   │  │  • Resource types                            │  │     │
│   │  └─────────────────────────────────────────────┘  │     │
│   │  ┌─────────────────────────────────────────────┐  │     │
│   │  │ Stop Conditions                              │  │     │
│   │  │  • CloudWatch alarm integration              │  │     │
│   │  │  • Auto rollback on threshold breach         │  │     │
│   │  └─────────────────────────────────────────────┘  │     │
│   └───────────────────────────────────────────────────┘     │
│                           │                                  │
│                           ▼                                  │
│   ┌───────────────────────────────────────────────────┐     │
│   │              Experiment Execution                  │     │
│   │                                                    │     │
│   │  [Target Resources: EC2, ECS, RDS, EKS...]        │     │
│   │                                                    │     │
│   └───────────────────────────────────────────────────┘     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Terms

TermDescription
Experiment TemplateBlueprint defining actions, targets, and stop conditions
ActionFault injection operation (EC2 stop, network latency, etc.)
TargetAWS resources to inject faults into (ID, tags, filters)
Stop ConditionAuto-stop conditions (CloudWatch alarm integration)
Scenario LibraryPre-built experiment templates

Supported Actions

Major Fault Types

AWS FIS Supported Actions:
├── EC2
│   ├── aws:ec2:stop-instances
│   ├── aws:ec2:terminate-instances
│   ├── aws:ec2:reboot-instances
│   └── aws:ec2:send-spot-instance-interruptions
│
├── ECS
│   ├── aws:ecs:stop-task
│   └── aws:ecs:drain-container-instances
│
├── EKS
│   ├── aws:eks:terminate-nodegroup-instances
│   └── aws:eks:pod-delete
│
├── RDS
│   ├── aws:rds:failover-db-cluster
│   └── aws:rds:reboot-db-instances
│
├── Network
│   ├── aws:network:disrupt-connectivity
│   └── aws:ssm:send-command (inject network latency)
│
└── AZ Level
    └── aws:ec2:asg-insufficient-instance-capacity
        (simulate AZ capacity issues)

Designing FIS Experiments

Guardrails for Safe Experiments

Experiment Safety Measures:
├── Set Stop Conditions (Required!)
│   ├── Link CloudWatch alarms
│   │   └── Error rate > 5% → auto-stop experiment
│   ├── Response time thresholds
│   │   └── P99 latency > 3s → stop experiment
│   └── Service availability alarms
│
├── Limit Target Scope
│   ├── Tag filters to target test environments only
│   ├── Specify resource percentage (e.g., only 30%)
│   └── Target specific AZs only
│
└── Progressive Experiments
    ├── Phase 1: Test in development environment
    ├── Phase 2: Test in staging environment
    └── Phase 3: Limited scope in production

Exam Tip

Exam Point: FIS experiments must have Stop Conditions configured. Link CloudWatch alarms to automatically stop experiments when thresholds are breached.

Common Experiment Scenarios

ScenarioActionValidation Point
EC2 Instance Failurestop-instancesDoes Auto Scaling launch new instances?
AZ FailureStop all instances in specific AZMulti-AZ failover behavior
RDS Failoverfailover-db-clusterApplication auto-reconnection
ECS Task Failurestop-taskDoes service restart tasks?
Network LatencySSM tc commandTimeout/retry logic behavior

CI/CD Pipeline Integration

Automated Resilience Testing

┌─────────────────────────────────────────────────────────────┐
│              CI/CD + FIS Integration Architecture            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   [Code Commit] → [Build] → [Deploy to Staging]             │
│                                    │                         │
│                                    ▼                         │
│                        ┌─────────────────┐                  │
│                        │   AWS FIS       │                  │
│                        │   Run Experiment│                  │
│                        └─────────────────┘                  │
│                                    │                         │
│                          ┌────────┴────────┐                │
│                          ▼                 ▼                │
│                      [Pass]            [Fail]               │
│                          │                 │                │
│                          ▼                 ▼                │
│               [Deploy to Prod]     [Rollback + Alert]       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Pricing Structure

Pricing (US East)

ItemCost
Action-minute$0.10/action-minute

Example: Stopping 5 EC2 instances for 10 minutes

  • 5 actions × 10 minutes = 50 action-minutes
  • Cost: 50 × $0.10 = $5.00

Cost Optimization Tips

Cost Reduction Strategies:
├── Short experiment duration
│   └── Run only minimum time needed for validation
├── Minimize target resources
│   └── Sample resources, not entire fleet
├── Use dev/test environments
│   └── Validate in non-production before production
└── Leverage Scenario Library
    └── Pre-built templates reduce trial and error

SAA-C03 Exam Focus Points

Commonly Tested Scenarios

  1. Resilience Testing: "Validate system behavior during failures" → AWS FIS
  2. Game Day: "Recreate real failure scenarios" → AWS FIS
  3. AZ Failure Simulation: "Verify behavior during specific AZ failure" → AWS FIS
  4. Automated Testing: "Resilience testing in CI/CD" → AWS FIS
  5. Safety Measures: "Stop conditions, CloudWatch alarm integration" → FIS Stop Conditions

Sample Exam Questions

Exam Tip

Sample Exam Question 1: "How can you verify that a Multi-AZ deployed application actually works correctly during an AZ failure?"

→ Answer: AWS Fault Injection Simulator (simulate AZ failure, verify failover behavior)

Exam Tip

Sample Exam Question 2: "How can you automatically stop an FIS experiment if application error rate spikes?"

→ Answer: Link CloudWatch alarm as FIS experiment Stop Condition

Exam Tip

Sample Exam Question 3: "What measures are needed to safely run chaos engineering experiments in production?"

→ Answer: Configure Stop Conditions + Limit target resource scope + Progressive experiments


Frequently Asked Questions

Q: What's the difference between FIS and third-party chaos engineering tools?

AWS FIS integrates natively with AWS services, making IAM permissions, CloudWatch alarms, and target filtering easier. Third-party tools like Chaos Monkey and Gremlin support multi-cloud, but FIS has stronger AWS-specific features.

Q: Does FIS actually terminate resources?

Yes, it actually affects resources. The EC2 stop action actually stops instances. Always test in non-production environments first, and run with limited scope in production.

Q: What if something goes wrong during an experiment?

If Stop Conditions trigger, FIS automatically stops the experiment. You can also manually stop via console/CLI. Note that already-terminated instances must be recovered by Auto Scaling.

Q: Which regions support FIS?

FIS is available in most AWS commercial regions. Check AWS documentation for new region availability.

Q: Can FIS corrupt data?

FIS itself doesn't directly corrupt data. However, if EBS is configured to delete on instance termination, data loss can occur. Verify data backups and EBS settings before experiments.


References