Chaos Engineering and Mesos


Modern applications are distributed by default. They are composed by a bunch of services that have to communicate with each other (a back-end and a database, for example). This distributed computing is hard, and when these applications are deployed in the cloud it’s even harder. In fact, the number of variables that can produce an outage of these applications is growing very fast. Usually these applications depend on public cloud services like Amazon or Google and these services can have outages, instances can die and other infrastructure related issues can appear. When working in the cloud, resilience (fault tolerance) must be in the DNA of an application. This means that in case some failure happens (and keep in mind that this will happen) an application should remain available with very little impact in its customers. But, how can you be sure that your application is resilient? How do you know that your Amazon S3 backed service will survive to a total outage of an entire Amazon S3 region? (which seems unlikely, but it happened in March, 2017)

With these considerations in mind, Netflix engineers confronted these challenges while moving their platform to AWS and gave us the solution: the principles of Chaos Engineering.

The manifest defines Chaos Engineering as:

The discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Essentially, Chaos Engineering is about getting confidence in the application resiliency by injecting “once in a blue moon” (as defined by Bruce Wong – Netflix) failures and measuring the impact that these failures have in the application’s performance.

This controlled failure injection helps to discover the weaknesses of the application in a controlled environment before they appear, so it’s possible to learn about them, to fix them and indeed increase the application’s availability. Even if you aren’t able to fix the problem, your team will gain knowledge about how to confront this problem and, in case that this problem appears some day at 3.00 AM, the team will know how to solve it in a pretty straightforward way.

For Netflix, every failure and system outage is an opportunity to learn about the application behavior, how to fix it and how to improve its design. For example, during an Amazon ELB total region outage in December, 2012 they discovered that they needed to have a multi-region active-active replica to survive these outages, which weren’t considered when designing Netflix architecture.

It’s recommended to drive these experiments in a production environment, because if you want to see the real impact you need real traffic. The bad news about doing a chaos experiment in production is that you can cause an entire application outage. For this reason, Netflix launches its experiments during office hours, so if the experiment causes an outage the engineering team can solve it as soon as possible.

The phases of a Chaos experiment, as defined in the manifest, are the following:

  1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

It can be briefly resumed in: break things, see what happens and act accordingly to the outcome. There are a lot of experiments that you can do: stop server instances, simulate network partitions, add network latency, revoke connectivity to a database… Be creative (who would have imagine that a DynDNS outage could happen, making half of the internet services unavailable?). Keep in mind that some experiments should be simulated (this is why it’s called controlled failure injection). For example, if you want to see what happens if a service can’t access to a database table, don’t drop the table in production, simply revoke the service permissions to access the table.

If a chaos experiment is successful, you shouldn’t stop doing this experiment. As the application evolves and new features are added to it, it’s important to repeat the test often to see regression problems and to make sure that the new features have the same resiliency principles than the rest of the application. That is when having these experiments automated can make the trick (see the Simian Army).

A quick chaos experiment example

As a quick example of how chaos engineering works, in a previous post, we talk about the split brain problem in Akka Cluster and how to avoid them, but we haven’t tested it. For doing so, we can do a chaos experiment that causes a network partition between the instances of the cluster, and see if we end with two clusters or everything works as expected. For simulating the network partition, we can, for instance, add firewall rules to the affected instances which in fact avoid the connection between them.

Applying chaos engineering in Mesos

Apache Mesos is a resource manager that helps scheduling applications and services across the whole datacenter from a single point and using APIs (the same as an OS kernel does with the processor cores in a single machine). When you use Mesos as a service orchestrator, you usually use Marathon to abstract some of the Mesos complexity and manage these services. Marathon is a Mesos scheduler that manages services and applications allowing to scale it and adding self healing (if an instance of a service is killed it will launch a new one) to the services. It provides its own APIs to manage the deployed services. An instance of a service is associated with a task running in Mesos, so it seems that the way to inject chaos in an application is killing some of its services’ tasks and checking the application availability.

With this in mind, driving a chaos experiment in Marathon it’s pretty easy (destroying is always easier than building). Taking a look around in the Marathon REST API, we could see that we have an endpoint from which we can list all the services and its associated tasks:

GET /v2/apps?embed=apps.tasks         // for all the services running in Marathon
GET /v2/apps/{appId}?embed=app.tasks  // for a concrete service

From this query, you can get the IDs of the running tasks of the services running in Marathon. Then, simply select the ones you want to kill (it could be randomly or with another algorithm of your choice, like using Mesos attributes to see which tasks are running in a concrete rack and killing them, simulating a total rack outage) and call the following endpoint to kill the tasks:

>code>POST /v2/tasks/delete
  "ids": [


Chaos Engineering is a useful way to test and improve an application availability and resiliency, but it has to be done carefully, because it’s run in production and the experiments can cause a system outage. However, the profits of driving these kind of experiments deserve the risk.

Maybe if Amazon had used this experiments, they would have discovered that the red status icon in the dashboard for the S3 US-East-1 region was stored in the same region that it was suppose to inform about…

Links and resources


Roberto Veral

I started getting in touch with Big Data, using Spark and the Hadoop ecosystem, working with cloud providers. Now, as a Big Data Developer, I work with IaaS, Terraform, Ansible, Scala, Spark... I am a vocational Computer Scientist and I enjoy learning new technologies and discovering new things.

More Posts

Follow Me:

2 thoughts on “Chaos Engineering and Mesos”

Comments are closed.