Major disruptions that impact everyday business operations do happen. To convince yourself you can have a look at the Timeline of AWS’s major Outages, think about Facebook’s recent outage. So if your services are hosted in one of those affected Datacenters you may suffer important financial lose.
Preparing for those scenarios is the best insurance policy to mitigate the impact of such events. On top of that making sure you are prepared brings many benefits:
This is where Chaos engineering comes into play
Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions
Principles of chaos engineering (source)
Chaos engineering provides theories, best practices, and tools to get your company ready to face many types of disruption. I like to categorize disruptions into 4 categories and tackle them independently.
When starting with Choas Engineering I recommend starting with the most critical scenarios (Severity 1) because they lead to the most improvement for your company. For instance, they force you to lay the foundation for incident response and incident management.
Next are the steps you should follow to get ready and recover from any disaster.
The main requirement to start with chaos engineering is a good Alerting and Monitoring system. Indeed you cannot go blind and simulate disasters in your production environment.
Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
You should start with monitoring health, and availability. There are three types of monitoring you will need to have in place:
I suggest that you construct your dashboard and metrics around a checklist that represents a boot sequence of your entire system. Ex:
Once you have established a baseline and worked on your monitoring and Observability it is time to put your system under stress to validate what you have achieved so far. You are going to shut down all your dev/staging/pre-prod environments once a day and recreate them from scratch.
Unless you have a very international team there is always a window where no one is working, while your environments are still up and running. If no one is working then the environment should be shut down to save cost and test your infrastructure scripts.
It may not be wise to start with a big bang approach (aka. deleting everything at once). Instead, use a progressive approach. For instance, start by deleting microservices every night and redeploying them in the morning using your CI/CD pipeline. Next, add the deletion of databases, you should validate that your backup and restore procedure is working properly. Finally, delete infrastructure components every day and recreate them from scratch the next day.
Once you can achieve this daily routine without issue it means that you have a solid set of automation scripts to manage your infrastructure. This exercise should also be the perfect occasion to perfect your monitoring.
Getting an environment up and down is the easy part if you did your job properly from day 1 this should almost be granted. If not it means that you have accumulated technical debt. Lucky for you by deleting the environment every day you reduce the chance of having technical dept. If someone does something manually their day of work would be reduced to ashes the next day. Believe me, everyone will soon learn the lesson and everything as code will thrive
The next task to be ready for a catastrophic event is to make your infrastructure a bit more configurable. There is no reason for an automated script to lock your product in one region of the world. Thus you should refactor your code and configuration such that you can deploy in any region your cloud provider offers.
Beware region have some specificities (type of instance, service available, etc.). Now is the time to know more about your cloud provider. You should also study different strategies to migrate to a different region:
So now you should be able once a week, for instance, to redeploy your environment in a different region and no-one should see the difference.
At this point, you are technically ready to face any disaster. But do you have a clear idea of what to do? Most likely not.
It is important to define an incident response strategy. Document how people should act in case a problem occurs and it requires you to deploy your entire stack somewhere else.
Last you need practice. Choose a date to get everyone to involve and act as if the incident occurred.
We may not be in the fun part of chaos engineering, there is no cool tool like Chaos Monkey or Gremlin, but I truly believe this is the first element to put in your road map. Without a recovery plan, I may not be so safe to create chaos in your infrastructure yet anyway.
In future, articles, I plan to cover chaostoolkit to automate some chaos scenarios that Severity 1 and 2 disruptions. You should definitely check it-it is the only tool I know designed for experimenting with these levels of disruptions.