How cloud native is driving Zerodha, the world’s largest retail stock investment platform
Challenge
The Indian stock brokerage Zerodha handles 8 million trades a day, making it the largest retail stock investment platform in the world. With a complex, hybrid infrastructure, a heavily regulated technology stack, and end-user applications and internal systems with disparate external dependencies, the company needed “a centralized, uniform monitoring infrastructure that worked across a wide variety of complex environments,” says CTO Kailash Nadh.
Solution
Zerodha adopted Prometheus for monitoring, then migrated many of its services from VMs to containers, and then to Kubernetes.
Impact
“Prometheus gave us powerful monitoring for critical, low-latency financial systems,” says Nadh. With Kubernetes, “we’ve gained scale and modularity, especially in an environment where sweeping regulatory changes often demand significant changes to systems. In addition, we are able to utilize cloud resources more efficiently.”
By the numbers
Cost of compute instances
Reduced by at least 50%
1 billion retail trades processed in 2019
Complete environments with all the dependencies can be brought up in minutes instead of hours
The Indian stock brokerage Zerodha handles 8 million trades a day, making it the largest retail stock investment platform in the world.
“Our mission is to make trading and investing easy and accessible to the masses,” says CTO Kailash Nadh.
Given its industry and scale, Zerodha requires infrastructure that spans a public cloud (AWS for most in-house applications) and physical machines in multiple data centers, with specific regulatory and technical requirements for capital market connectivity via leased lines and adapters from various stock exchanges.
That complexity— along with a heavily regulated technology stack and end-user applications and internal systems with disparate external dependencies- led the company to embrace cloud native technologies.
“We needed a centralized, uniform monitoring infrastructure that worked across a wide variety of environments,” Nadh says. “Prometheus gave us powerful monitoring for critical, low-latency financial systems. It helped us aggregate and monitor metrics infra-wide. The large number of existing exporters and the ease of writing custom exporters enabled us to attain wide coverage in a short period of time.”
Additionally, Zerodha began moving its services from VMs to containers, and gradually to Kubernetes in 2020. Because all of its apps had already been developed with a service-oriented architecture and 12-factor approach, the migration was straightforward. The infrastructure team began by creating CI pipelines with GitLab as the company has a well-defined process of pushing changes to production with its CI/CD process. With a focus on infrastructure-as-code practices, Zerodha uses a mix of Terraform, Packer, and eksctl to create its Kubernetes infrastructure on AWS and hosts container artifacts on an internal registry powered by AWS (ECR).
“We have been conscious of not creating an ops vs. dev divide,” says Nadh. “Developers are responsible for the entire lifecycle of their projects, including deployments. We created a standard template for deployments with Kubernetes that allows developers to craft their own deployments with minimal scaffolding or direct involvement of DevOps engineers.”
“Kubernetes has helped us standardize the deployment process of applications built on many different kinds of stacks across teams.”
— KAILASH NADH, CTO AT ZERODHA
As a result, deployment rollouts are faster and more frequent: Complete environments with all the dependencies can be brought up in minutes, rather than hours, with very little manual intervention. “Kubernetes has helped us standardize the deployment process of applications built on many different kinds of stacks across teams,” he says. “We’ve gained scale and modularity, especially in an environment where sweeping regulatory changes often demand significant changes to systems. Kubernetes also allows us fine-grained resource allocation for workloads, reducing cost of compute instances by at least 50%.”
Zerodha is using a CNCF incubating project, NATS, “for transmitting large volumes of real time-market data across infrastructures across applications at high throughputs,” says Nadh. “Many of our components depend on the ease of subscriptions and instant and ‘automagical’ failovers NATS offers. We had at least three other technologies that we had tried over the years before stumbling upon NATS. It pretty much solved all the issues we had faced with other message streams and PubSub systems.”
Other CNCF projects are in the pipeline at Zerodha, such as Envoy as an edge proxy for inter-application networking. “We use Kong as an API Gateway internally. However, we are looking for a service mesh that can gracefully handle retries and failures, load balancing, cross-infra discovery, etc. Envoy seems to fit the bill,” he says. “We are looking at Istio as the control plane, and Jaeger, which comes bundled with it, for enabling tracing across services.”
“Cloud native enables companies to do faster iterations and deployments and have well defined moving parts that are easily managed.”
— KAILASH NADH, CTO AT ZERODHA
For Zerodha, the impact of cloud native has been clear: “Thanks to our lean, modern, and scalable infrastructure and architecture, we have been able to focus on high quality technology and products that have catapulted us to the #1 position in the industry, trumping large stock brokers operated by megabanks,” Nadh says. “We have been able to achieve unprecedented scale with our stack, handling a million concurrent users who receive tens of millions of streaming market quotes per second, and place millions of low-latency trades. In 2019, we processed a billion retail trades, possibly the highest for any stock broker in the world. Our daily volumes amount to about 20% of all retail stock market volumes in India.”
In the near future, the Zerodha team plans to focus on migrating most stateless workloads to Kubernetes, implement a service mesh, and adopt more cloud native observability tools. The company has seen the benefits of cloud native, and is continuing on the journey.
“Cloud native enables companies to do faster iterations and deployments and have well defined moving parts that are easily managed,” says Nadh. “Running applications in a cloud native stack also helps companies save on infrastructure costs as they are able to better utilize the cloud resources and scale up/down dynamically. Cloud native technologies have proven to be a great way of running, managing, and monitoring complex application stacks.”