Chaos engineering

The logo for Chaos Monkey used by Netflix

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.^[1]

Concept[]

In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of service—often generalized as resiliency—is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.

Chaos engineering can be used to achieve resilience against:

Infrastructure failures
Network failures
Application failures

History[]

While overseeing Netflix's migration to the cloud in 2011,^[2]^[3] Greg Orzell had the idea to address the lack of adequate resilience testing by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:

"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."^[4]

By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.

The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by Martin Fowler in 2012.^[5]

Perturbation models[]

A chaos engineering tool implements a perturbation model. The perturbations, also called turbulences, are meant to mimic rare or catastrophic events that can happen in production. To maximize the added value of chaos engineering, the pertubations are expected to be realistic.^[6]

Server shutdowns[]

One perturbation model consists of randomly shutting down servers. Netflix' Chaos Monkey is an implementation of this perturbation model.

Latency injection[]

Introduces communication delays to simulate degradation or outages in a network. For example, Chaos Mesh supports the injection of latency.

Resource exhaustion[]

Eats up a given resource. For instance, Gremlin can fill the disk up.

Chaos engineering tools[]

Chaos Monkey[]

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure.^[2] It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.

The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.^[7]^[8]

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:^[9]

"Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."

LitmusChaos Bird curating Chaos Engineering on Kubernetes systems

Simian Army logo by Netflix

The Simian Army^[8]^[10] is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure and includes the following tools:^[11]

At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "Region".^[12] Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.

Chaos Gorilla drops a full Amazon "Availability Zone" (one or more entire data centers serving a geographical region).^[13]

LitmusChaos[]

LitmusChaos is an open source chaos engineering toolset that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way... Developers & SREs can simply execute Chaos Engineering with Litmus as it is easy to use, based on modern chaos engineering practices & community collaborated. Litmus is 100% open source & CNCF-hosted.

LitmusChaos is licensed under Apache 2 and has been vital in driving the next-gen Chaos Engineering story with the release of version 2.0! Version 2.0.0 brings newer capabilities to the LitmusChaos platform, enabling a more efficient practice of chaos engineering. The major version upgrade is being carried out to reflect significant improvements and new features in the platform - many of which were introduced & curated across several preceding 2.0 beta releases with community feedback (thanks to all the early adopters & beta testers for your continued support. Some of these changes, especially. newer experiments and observability improvements have been made available in 1.x too).

Litmus 1.x brought a cloud-native approach to chaos engineering to the definition and execution of chaos intent, along with a ready set of experiments maintained in the ChaosHub. Along the way, newer requirements were incorporated into the project, most notably around a centralized management approach for managing chaos across environments (K8s clusters and cloud instances) and the ability to define workflows to stitch together multiple experiments as part of a complex scenario.

The 2.0 GA release brings these features into the mainstream, having been validated for their usefulness & architecture.

The initial start of your chaos engineering journey is straightforward with Litmus as it is:

Open-Source: Provides Flexibility & Agility along with community support.
Declarative in nature: Litmus provides chaos CRDs to manage chaos. Using chaos API, orchestration, scheduling and complex workflow management can be done declaratively.
Having Chaos experiments readily available: Most of the generic chaos experiments are readily available for you to get started with your initial chaos engineering needs
You can bring your own chaos by creating a basic experiment structure quickly using the Litmus SDK. Kubernetes developers and SREs need to add chaos logic to create a new experiment. ^[14]

Byte-Monkey[]

A small Java library for testing failure scenarios in JVM applications. It works by instrumenting application code on the fly to deliberately introduce faults such as exceptions and latency.^[15]

Chaos Machine[]

ChaosMachine ^[16] is a tool that does chaos engineering at the application level in the JVM. It concentrates on analyzing the error-handling capability of each try-catch block involved in the application by injecting exceptions.

Proofdock Chaos Engineering Platform[]

A chaos engineering platform that focuses on and leverages the Microsoft Azure platform and the Azure DevOps services. Users can inject failures on the infrastructure, platform and application level.^[17]

Gremlin[]

A "failure-as-a-service" platform built to make the Internet more reliable. It turns failure into resilience by offering engineers a fully hosted solution to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.^[18]

Facebook Storm[]

To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.^[19]

Days of Chaos[]

Inspired by AWS GameDays^[20] to test the resilience of its applications, teams from Voyages-sncf.com participated in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses, and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.^[21]

Presented at the 2017 DevOps REX conference^[22] the concept is presented on the site http://days-of-chaos.com in order to collect the other experiments.

ChaoSlingr[]

ChaoSlingr is the first Open Source application of Chaos Engineering to Cyber Security. ChaoSlingr is focused primarily on performing security experimentation on AWS Infrastructure to proactively discover system security weaknesses in complex distributed system environments. Published on GitHub in September 2017.

Chaos Toolkit[]

The Chaos Toolkit was born from the desire to simplify access to the discipline of chaos engineering and demonstrate that the experimentation approach can be done at different levels: infrastructure, platform but also application. The Chaos Toolkit is an open-source tool, licensed under Apache 2, published in October 2017.^[23]

Mangle[]

Mangle enables you to run chaos engineering experiments seamlessly against applications and infrastructure components to assess resiliency and fault tolerance. It is designed to introduce faults with very little pre-configuration and can support any infrastructure that you might have including K8S, Docker, vCenter or any Remote Machine with ssh enabled. With its powerful plugin model, you can define a custom fault of your choice based on a template and run it without building your code from scratch.

Chaos Mesh[]

Chaos Mesh is an open-source cloud-native Chaos Engineering platform that orchestrates chaos experiments in Kubernetes environments. It supports comprehensive types of failure simulation, including Pod failures, container failures, network failures, file system failures, system time failures, and kernel failures.

Chaos Mesh was published in December 2019 under the Apache 2 license, and became a Cloud Native Computing Foundation (CNCF) sandbox project in July 2020.^[24]

NetHavoc[]

NetHavoc is a Chaos Engineering Tool offered by Cavisson designed to inject failure at the Resource, Infrastructure, Network, and Application levels for various platforms like Linux, K8s, Windows, PCF, Cloud & Container and make applications more resilient.

Notes and references[]

^ "Principles of Chaos Engineering". principlesofchaos.org. Retrieved 2017-10-21.
^ Jump up to: ^a ^b "The Netflix Simian Army". Netflix Tech Blog. Medium. 2011-07-19. Retrieved 2017-10-21.
^ US20120072571 A1, Orzell, Gregory S. & Izrailevsky, Yury, "Validating the resiliency of networked applications"
^ "Netflix Chaos Monkey Upgraded". Netflix Tech Blog. Medium. 2016-10-19. Retrieved 2017-10-21.
^ "PhoenixServer". martinFowler.com. Martin Fowler (software engineer). 2012-07-10. Retrieved 2021-01-14.
^ Zhang, Long; Morin, Brice; Baudry, Benoit; Monperrus, Martin (2021). "Maximizing Error Injection Realism for Chaos Engineering with System Calls". IEEE Transactions on Dependable and Secure Computing: 1–1. arXiv:2006.04444. doi:10.1109/TDSC.2021.3069715. ISSN 1545-5971.
^ "Netflix libère Chaos Monkey dans la jungle Open Source - Le Monde Informatique". LeMondeInformatique (in French). Retrieved 2017-11-07.
^ Jump up to: ^a ^b "SimianArmy: Tools for your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures". Netflix, Inc. 2017-10-20. Retrieved 2017-10-21.
^ "Mais qui sont ces singes du chaos ?" [But who are these monkeys of chaos?]. 15marches (in French). 2017-07-25. Retrieved 2017-10-21.
^ SimianArmy: Tools for keeping your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures, Netflix, Inc., 2017-11-07, retrieved 2017-11-07
^ SemiColonWeb (2015-12-08). "Infrastructure : quelles méthodes pour s'adapter aux nouvelles architectures Cloud ? - D2SI Blog". D2SI Blog (in French). Archived from the original on 2017-10-21. Retrieved 2017-11-07.
^ "Chaos Engineering Upgraded", medium.com, 19 April 2017, retrieved 2020-04-10
^ "The Netflix Simian Army", medium.com, retrieved 2017-12-12
^ "Cloud-Native Chaos Engineering – Enhancing Kubernetes Application Resiliency". CNCF. 2019-11-06.
^ "GitHub repo of Byte-Monkey". GitHub. 2019-06-20.
^ Zhang, Long; Morin, Brice; Haller, Philipp; Baudry, Benoit; Monperrus, Martin (2019). "A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM". IEEE Transactions on Software Engineering: 1. arXiv:1805.05246. doi:10.1109/TSE.2019.2954871. ISSN 0098-5589. S2CID 46892241.
^ "A chaos engineering platform for Microsoft Azure". medium.com. Retrieved 2020-06-28.
^ "Gremlin raises $18 million to expand 'failure-as-a-service' testing platform". VentureBeat. 2018-09-28. Retrieved 2018-10-24.
^ Hof, Robert (2016-09-11), "Interview: How Facebook's Storm Heads Off Project Data Center Disasters", Forbes, retrieved 2017-10-21
^ SemiColonWeb (2016-07-04). "GameDay AWS: test the resilience of your applications Cloud". D2SI Blog. Archived from the original on 2017-10-21. Retrieved 2017-10-21.
^ "DevOps: feedback from Voyages-sncf.com - Blog du Moderator", Moderator's Blog (in French), 2017-03-17, retrieved 2017-10-21
^ "Days of Chaos: the development of the devops culture at Voyages-Sn ..." Slideshare. 2017-10-03. devops REX.^{[permanent dead link]}
^ Miles, Russ (2017-10-06). "Introducing and Extending the Chaos Toolkit". Russ Miles (the Geek on a Harley). Retrieved 2017-10-23.
^ Chaos Mesh Authors (July 28, 2020). "Chaos Mesh® Joins CNCF as a Sandbox Project". Chaos Mesh®.

External links[]

Principle of Chaos Engineering – The Chaos Engineering manifesto
Chaos Engineering – Adrian Hornsby
How Chaos Engineering Practices Will Help You Design Better Software – Mariano Calandra

[1] "Principles of Chaos Engineering". principlesofchaos.org. Retrieved 2017-10-21.

[blog-2] Jump up to: ^a ^b "The Netflix Simian Army". Netflix Tech Blog. Medium. 2011-07-19. Retrieved 2017-10-21.

[3] US20120072571 A1, Orzell, Gregory S. & Izrailevsky, Yury, "Validating the resiliency of networked applications"

[4] "Netflix Chaos Monkey Upgraded". Netflix Tech Blog. Medium. 2016-10-19. Retrieved 2017-10-21.

[5] "PhoenixServer". martinFowler.com. Martin Fowler (software engineer). 2012-07-10. Retrieved 2021-01-14.

[6] Zhang, Long; Morin, Brice; Baudry, Benoit; Monperrus, Martin (2021). "Maximizing Error Injection Realism for Chaos Engineering with System Calls". IEEE Transactions on Dependable and Secure Computing: 1–1. arXiv:2006.04444. doi:10.1109/TDSC.2021.3069715. ISSN 1545-5971.

[7] "Netflix libère Chaos Monkey dans la jungle Open Source - Le Monde Informatique". LeMondeInformatique (in French). Retrieved 2017-11-07.

[github-8] Jump up to: ^a ^b "SimianArmy: Tools for your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures". Netflix, Inc. 2017-10-20. Retrieved 2017-10-21.

[9] "Mais qui sont ces singes du chaos ?" [But who are these monkeys of chaos?]. 15marches (in French). 2017-07-25. Retrieved 2017-10-21.

[10] SimianArmy: Tools for keeping your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures, Netflix, Inc., 2017-11-07, retrieved 2017-11-07

[11] SemiColonWeb (2015-12-08). "Infrastructure : quelles méthodes pour s'adapter aux nouvelles architectures Cloud ? - D2SI Blog". D2SI Blog (in French). Archived from the original on 2017-10-21. Retrieved 2017-11-07.

[12] "Chaos Engineering Upgraded", medium.com, 19 April 2017, retrieved 2020-04-10

[13] "The Netflix Simian Army", medium.com, retrieved 2017-12-12

[14] "Cloud-Native Chaos Engineering – Enhancing Kubernetes Application Resiliency". CNCF. 2019-11-06.

[15] "GitHub repo of Byte-Monkey". GitHub. 2019-06-20.

[16] Zhang, Long; Morin, Brice; Haller, Philipp; Baudry, Benoit; Monperrus, Martin (2019). "A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM". IEEE Transactions on Software Engineering: 1. arXiv:1805.05246. doi:10.1109/TSE.2019.2954871. ISSN 0098-5589. S2CID 46892241.

[17] "A chaos engineering platform for Microsoft Azure". medium.com. Retrieved 2020-06-28.

[18] "Gremlin raises $18 million to expand 'failure-as-a-service' testing platform". VentureBeat. 2018-09-28. Retrieved 2018-10-24.

[19] Hof, Robert (2016-09-11), "Interview: How Facebook's Storm Heads Off Project Data Center Disasters", Forbes, retrieved 2017-10-21

[20] SemiColonWeb (2016-07-04). "GameDay AWS: test the resilience of your applications Cloud". D2SI Blog. Archived from the original on 2017-10-21. Retrieved 2017-10-21.

[21] "DevOps: feedback from Voyages-sncf.com - Blog du Moderator", Moderator's Blog (in French), 2017-03-17, retrieved 2017-10-21

[22] "Days of Chaos: the development of the devops culture at Voyages-Sn ..." Slideshare. 2017-10-03. devops REX.^{[permanent dead link]}

[23] Miles, Russ (2017-10-06). "Introducing and Extending the Chaos Toolkit". Russ Miles (the Geek on a Harley). Retrieved 2017-10-23.

[24] Chaos Mesh Authors (July 28, 2020). "Chaos Mesh® Joins CNCF as a Sandbox Project". Chaos Mesh®.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]