These templates let you perform fault injection experiments on resources (applications, network, and infrastructure) in the AWS Cloud.
AWS Fault Injection Simulator (AWS FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads. Fault injection is based on the principles of chaos engineering. These experiments stress an application by creating disruptive events so that you can observe how your application responds. You can then use this information to improve the performance and resiliency of your applications so that they behave as expected.
To use AWS FIS, you set up and run experiments that help you create the real-world conditions needed to uncover application issues that can be difficult to find otherwise. AWS FIS provides templates that generate disruptions, and the controls and guardrails that you need to run experiments in production, such as automatically rolling back or stopping the experiment if specific conditions are met.
This CDK package will deplay a bunch of stacks.
(1) the parent stack FISPa
, (2) a stack for the IAM roles FisRole
, (3) a stack for the stop-condition StopCond
(CloudWatch alarm), (4) a stack for each FIS experiment group (EC2API
, AsgExp
, EksExp
, NaclExp
, Ec2InstExp
), and (5) a stack dedicated to uploading SSM documents FisSsmDocs
.
You can pick and choose which experiment group you want to deploy by simply commenting out the respective stacks here
1 - The IAM roles required to run the experiments:
- The AWS FIS role with all necessary policies as described here
- SSM Automation document role for faults using SSMA.
2 - A set of AWS FIS experiments to get you started:
- Including:
- Stopping and restarting (after duration) all EC2 instances in a VPC, an AZ, and with particular tags.
- Injecting CPU stress on random EC2 instances in a VPC
- Injecting latency on requets to particular domain (e.g. www.amazon.com) to all EC2 instances in a VPC, an AZ, and with particular tags.
- Including:
- Injecting EC2 API Internal Error on a target IAM role
- Injecting EC2 API Throttle Error on a target IAM role
- Injecting EC2 API Unavailable Error on a target IAM role
- Including:
- Terminate all EC2 instances of a random AZ in a particular auto scaling group.
- Injecting CPU stress on All EC2 instances of a particular auto scaling group.
- Including:
- Modifying Nacls associated with subnets that belong to a particular AZ to deny traffic in that AZ.
- Including:
- Running the EC2 API action TerminateInstances on the EKS target node group.
- Including:
- Changing a particular security group ingress rule (open SSH to 0.0.0.0/0) to verify remediation automation or monitoring. (Courtesy of Jonathan Rudge). Possible remediation automation (https://github.com/adhorn/ssh-restricted)
- Including:
- Denying Access to an S3 Resource from any application/services by targeting its Iam Role. (Courtesy of Rudolph Wagner)
- Including:
- Support for Lambda Python runtime via the chaos-lambda library.
chaos_lambda
is a small library injecting chaos into AWS Lambda. It offers simple Python decorators to inject latency, throw exception and modify the statuscode of Lambda functions.
- Support for Lambda Python runtime via the chaos-lambda library.
These sample FIS experiments uses default values for some of the parameters, such as a vpc_id
, asg_name
, eks_cluster_name
, etc.
Modify these in the file cdk.json
before deploying to reflect the particularity of your own AWS environment.
"context": {
"vpc_id": "vpc-01316e63b948d889d",
"asg_name": "Test-FIS-ASG",
"eks_cluster_name": "test-cluster-chaos",
"security_group_id": "sg-022eb488dbd1655b3",
"target_role_name": "target-role",
"s3-bucket-to-deny": "mybucket/*",
"ssm_parameter_name": "chaoslambda.config"
}
You can also specify your own tags for filtering EC2 instances. The currently used ones are defined as:
resourceTags: {
'FIS-Ready': 'true'
}
3 - An example stop-condition using CloudWatch alarm
All templates use the same CloudWatch Alarm to get you started using the stop-condition
. You can use this alarm to get familiar with canceling experiments. For example, you can trigger that alarm, for 1 minutes, using the following command:
aws cloudwatch set-alarm-state --alarm-name "NetworkInAbnormal" --state-value "ALARM" --state-reason "testing FIS"
Once you are familiar with the stop-condition
, you should of course update the CloudWatch alarms with ones specific to your application and architecture.
4 - A stack dedicated to uploading SSM docs (Automation or Run-Command)
You first need to install the AWS CDK as described here - typically using:
npm install -g [email protected]
You then must configure your workstation with your credentials and an AWS region, if you have not already done so. If you have the AWS CLI installed, the easiest way to satisfy this requirement is issue the following command:
aws configure
Finally, you can deploy these FIS experiments using the CDK as follows:
npm install
cdk bootstrap
cdk deploy --all
During the creation of the different stacks, some will generate a security warning as follow:
(NOTE: There may be security-related changes not in this list. See https://github.com/aws/aws-cdk/issues/1299)
Do you wish to deploy these changes (y/n)?
Select y
(yes).
npm run build
compile typescript to jsnpm run watch
watch for changes and compilenpm run test
perform the jest unit testscdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template
The cdk.json
file tells the CDK Toolkit how to execute your app.