How to Get Started with Chaos Engineering? — Preparing for Major Disruption

Major disruptions that impact everyday business operations do happen. To convince yourself you can have a look at the Timeline of AWS’s major Outages, think about Facebook’s recent outage. So if your services are hosted in one of those affected Datacenters you may suffer important financial lose.

Preparing for those scenarios is the best insurance policy to mitigate the impact of such events. On top of that making sure you are prepared brings many benefits:

Better overall architecture design
Less firefighting and less employee burnout
Better engagement in supporting production environment

This is where Chaos engineering comes into play

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions

Principles of chaos engineering (source)

Chaos engineering provides theories, best practices, and tools to get your company ready to face many types of disruption. I like to categorize disruptions into 4 categories and tackle them independently.

Severity level Chaos Engineering

When starting with Choas Engineering I recommend starting with the most critical scenarios (Severity 1) because they lead to the most improvement for your company. For instance, they force you to lay the foundation for incident response and incident management.

Next are the steps you should follow to get ready and recover from any disaster.

Build a Baseline

The main requirement to start with chaos engineering is a good Alerting and Monitoring system. Indeed you cannot go blind and simulate disasters in your production environment.

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

You should start with monitoring health, and availability. There are three types of monitoring you will need to have in place:

Network Monitoring: Validate that the network component of functioning properly. It includes firewalls, routers, switches, servers, Virtual Machines, etc. Network issues could be very damaged because they often have a big blast radius.
Infrastructure Monitoring: Validate the IT infrastructure is up and running. This includes external providers, services, servers, storage, platform, shared services, etc.
Application Monitoring: Validate that the software is behaving properly. This includes logs, errors codes, performances, user behaviors.

I suggest that you construct your dashboard and metrics around a checklist that represents a boot sequence of your entire system. Ex:

Network configuration ready
Number instance up vs number of instances required
Number of backing services up vs number requires
Number of microservices deployed vs number required
Health status of all services
Traffic in user/min vs expected traffic

Destroy your "dev" Environment Every Day

Once you have established a baseline and worked on your monitoring and Observability it is time to put your system under stress to validate what you have achieved so far. You are going to shut down all your dev/staging/pre-prod environments once a day and recreate them from scratch.

Unless you have a very international team there is always a window where no one is working, while your environments are still up and running. If no one is working then the environment should be shut down to save cost and test your infrastructure scripts.

It may not be wise to start with a big bang approach (aka. deleting everything at once). Instead, use a progressive approach. For instance, start by deleting microservices every night and redeploying them in the morning using your CI/CD pipeline. Next, add the deletion of databases, you should validate that your backup and restore procedure is working properly. Finally, delete infrastructure components every day and recreate them from scratch the next day.

Once you can achieve this daily routine without issue it means that you have a solid set of automation scripts to manage your infrastructure. This exercise should also be the perfect occasion to perfect your monitoring.

Getting an environment up and down is the easy part if you did your job properly from day 1 this should almost be granted. If not it means that you have accumulated technical debt. Lucky for you by deleting the environment every day you reduce the chance of having technical dept. If someone does something manually their day of work would be reduced to ashes the next day. Believe me, everyone will soon learn the lesson and everything as code will thrive

Migrate "dev" Environnement to Another Data Center

The next task to be ready for a catastrophic event is to make your infrastructure a bit more configurable. There is no reason for an automated script to lock your product in one region of the world. Thus you should refactor your code and configuration such that you can deploy in any region your cloud provider offers.

Beware region have some specificities (type of instance, service available, etc.). Now is the time to know more about your cloud provider. You should also study different strategies to migrate to a different region:

Active-Active: You have at least two sites up and running and requests are load-balanced between them
Active-Passive: You have one region with all your service and a second region with the infrastructure and no traffic is sent (Typically the cluster is waiting to scale up if the load is sent to that region)
Active-Recovery: You have one region and you are ready to deploy your infrastructure in a new region.

So now you should be able once a week, for instance, to redeploy your environment in a different region and no-one should see the difference.

Build your Incident Response Strategy

At this point, you are technically ready to face any disaster. But do you have a clear idea of what to do? Most likely not.

It is important to define an incident response strategy. Document how people should act in case a problem occurs and it requires you to deploy your entire stack somewhere else.

How do we detect such problems?
Who is in charge of coordinating the response to the incident?
Who should be contacted to handle the incident?
How do you communicate the incident internally and externally?
How long should it take at most to be back online?

Organize a Gameday

Last you need practice. Choose a date to get everyone to involve and act as if the incident occurred.

Follow the response procedure (validate that everything makes sense)
Migrate your infrastructure in new regions
Wait a couple of days
Migrate again to the original region.

Final thoughts

We may not be in the fun part of chaos engineering, there is no cool tool like Chaos Monkey or Gremlin, but I truly believe this is the first element to put in your road map. Without a recovery plan, I may not be so safe to create chaos in your infrastructure yet anyway.

In future, articles, I plan to cover chaostoolkit to automate some chaos scenarios that Severity 1 and 2 disruptions. You should definitely check it-it is the only tool I know designed for experimenting with these levels of disruptions.

References