The CNCF Technical Oversight Committee (TOC) has voted to accept Chaos Mesh as a CNCF incubating project.
Initially created as a testing platform for the open source distributed database, TiDB, Chaos Mesh is a versatile chaos engineering platform that orchestrates chaos experiments in Kubernetes environments. It helps to ensure that Kubernetes infrastructures can withstand unexpected disruptions by helping to identify potential points of failure.
Since being accepted into the CNCF Sandbox in July 2020, Chaos Mesh has achieved two major releases (v1.0 and v2.0) and 30 minor releases, which brought significant improvements in observability, functionality, and security. Some highlighted features include:
- Chaos Dashboard, a visual aid to help users manage and monitor chaos experiments through a Web UI.
- A native Workflow engine, used to define chaos scenarios to manage groups of chaos experiments and status checks of applications.
- More powerful and comprehensive chaos simulations covering StressChaos, DNSChaos, JVMChaos, AWSChaos, GCPChaos, HTTPChaos, and more.
- An authorization mechanism based on Kubernetes RBAC permission policies.
More than 50 organizations have adopted Chaos Mesh to test and improve the resiliency of their distributed systems. Adopters include ByteDance, DataStax, Percona, Prudential, NetEase Fuxi, RabbitMQ, SHAREit, XPeng Motors, and many more. In addition, cloud providers such as Microsoft Azure Chaos Studio have integrated Chaos Mesh into their SaaS solution to allow their users to inject faults into AKS clusters. Many of these companies are also contributing back to the project.
“The fact that Chaos Mesh is built on Kubernetes CRDs gives the project a head start already,” said Hui Zhang, senior Quality Assurance engineer at NetEase Fuxi. “It provides fine-grained chaos support, a unified UI – Chaos Dashboard, along with enhanced observability and precise chaos scope control. All of this is driven by an open, collaborative and vibrant community.”
“Chaos Mesh provides rich fault simulation methods to help infrastructure teams verify scenarios such as high availability, network traffic loss, and bi-directional synchronization in advance, which helps dig out solution flaws and reduce risks,” said Hengliang Tan, lead engineer at XPeng Motors. “It also helps our teams reduce test costs.”
“We built Chaos Mesh with a simple mission – to make Chaos Engineering easier so that complicated systems can become resilient as they should be,” said Cwen Yin, maintainer and co-creator of Chaos Mesh. “The power of the community and ecosystem is essential to achieve this goal. We are excited to see Chaos Mesh become an incubating project. CNCF is the driving force of the cloud native ecosystem and with the support and guidance of the community will help us evolve Chaos Engineering further.”
Chaos Mesh adopts a Kubernetes architecture that uses Kubernetes CustomResourceDefinitions (CRDs) to define chaos objects. It can also be tightly integrated with other cloud native projects such as Argo, Grafana, and Prometheus, making the chaos experience more manageable, customizable, and observable.
Main components:
- Chaos Dashboard: The visualization component of Chaos Mesh. Chaos Dashboard offers a set of user-friendly web interfaces through which users can manipulate and observe Chaos experiments.
- Chaos Controller Manager: The core logical component of Chaos Mesh. Chaos Controller Manager is primarily responsible for scheduling and managing Chaos experiments.
- Chaos Daemon: The main executive component. Chaos Daemon runs in the DaemonSet mode and has Privileged permission by default (which can be disabled).
- Chaosd: A toolkit to inject failures into non-Kubernetes nodes.
Notable milestones:
- 4.5K GitHub Stars
- 1.3K commits
- 800+ closed issues
- 125+ contributors from 60+ organizations
- 32 Releases
- 50+ adopters
“No cloud native deployment is perfect – failures will always happen, so using chaos engineering to establish a culture of resiliency can save organizations time and money,” said Chris Aniszczyk, CTO of CNCF. “We are very excited to see how Chaos Mesh can grow as an incubating project and impact the state of the chaos and resiliency engineering space.”
Chaos Mesh has a full roadmap, and the team is actively adding new features and functionality while improving the overall chaos experience. The team is working to provide efficient status validation mechanisms and reporting capabilities to enhance ease of use and observability. Ongoing optimization of the Workflow engine will enable users to achieve a complete chaos engineering cycle with Chaos Mesh. Enriching the types of faults supported to cover as many real-life faults in cloud-native systems as possible will improve functionality and scalability. In addition, the team will provide a plug-in mechanism to allow users to freely expand their own fault types and publish their own plug-ins and chaos scenarios to Chaos Mesh.
As a CNCF-hosted project, Chaos Mesh is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Chaos Mesh joins incubating technologies Argo, Buildpacks, CloudEvents, CNI, Contour, Cortex, CRI-O, Dragonfly, emissary-ingress, Falco, Flux, gRPC, KEDA, KubeEdge, NATS, Notary, OpenTelemetry, Operator Framework, Rook, SPIFFE, SPIRE, and Thanos, For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.