Zendesk and the art of microservice maintenance
Challenge
Zendesk was built as a monolithic Rails app, using MySQL database and running in a co-located data center on hardware the company owned. But as the company grew, “we realized that just throwing more and more stuff into a Rails monolith slowed down teams,” says Senior Principal Engineer Jon Moter. “Deploys were really painful and really risky..” The company decided to move to microservices and containers, but still needed to figure out a way to make adoption go smoothly.
Solution
After the team researched orchestration technologies in mid-2015, Moter says, “Kubernetes seemed like it was designed to solve pretty much exactly the problems we were having. Google knows a thing or two about containers, so it felt like, ‘All right, if we’re going to make a bet, let’s go with that one.’”
Impact
Today, about 70% of Zendesk applications are running on Kubernetes, and all new applications are built to run on it. Kubernetes has brought time savings, greater flexibility, and increased velocity to Zendesk. Previously, changing the resource profile of an application “was a big pain,” says Moter. “It could take a couple of days, in the best case, to get hardware changed.” In Kubernetes, that can take one or two minutes via a simple UI the team developed.
By the numbers
Resource profile changes
Take 1-2 minutes rather than days
Outage resolution
Happens with self-healing in minutes instead of hours spent patching things up
70% of applications now running on Kubernetes
Launched in 2007 with a mission of making customer service easy for organizations, Zendesk offers products involving real-time messaging, voice chat, and data analytics.
All of this was built as a monolithic Rails app, using MySQL database and running in a co-located data center on hardware the company owned.
That system worked fine for the first seven years or so. But as Zendesk grew—the company went public in 2014 and now has 145,000 paid customer accounts and 3,000 employees—it became clear that changes were needed at the infrastructure level, and the effort to make those changes would lead the company to microservices, containers, and Kubernetes.
“We realized that just throwing more and more stuff into a Rails monolith slowed down teams,” says Senior Principal Engineer Jon Moter. “Deploys were really painful and really risky. Every team at Zendesk, some of whom were scattered in engineering offices all over the world, were all tied to this one application.”
Moving to microservices was a logical solution. At the time, though, there was still a centralized ops team, and “provisioning resources was really slow,” he says. “If you wanted to build and deploy a service, you often had to make a request for hardware a quarter in advance.” Additionally, a “huge amount” of Chef logic was used to provision servers, and “the staging environment wasn’t really the same as the production environment because the networking was different, and one was at AWS and one was at data centers,” says Moter. “A lot of this made things inconsistent.”
Some teams ended up reverting to putting their code in the monolith, or “we ended up with a couple of mini-monoliths that mapped approximately to which office people were in,” he adds.
In late 2014, senior technical engineers at Zendesk began working on a better solution. They quickly decided to adopt Docker containers, then embarked on a six-month deep dive into best practices around microservices and how they could be adopted at Zendesk.
Moter’s team built some tooling called ZDI (Zendesk Docker Integration), which got developers set up with containers almost instantly. But “we don’t want Docker images to just be a developer thing; we want to be able to run them in our staging and production environments, too,” he says. “We started off looking at creating a minimal agent that would run on nodes and run Docker containers based on something in a Consul key value store. But then we realized we were trying to build our own orchestrator, and that seemed like a bad idea.”
There were just a couple of options for orchestration at the time, in the summer of 2015, and after some research, Moter says, “Kubernetes seemed like it was designed to solve pretty much exactly the problems we were having. Google knows a thing or two about containers, so it felt like, ‘all right, if we’re going to make a bet, let’s go with that one.’”
Moter and a small team were given the green light to figure out how to make Kubernetes work at Zendesk. It took about a year of work to get to the point of having clusters running in production. (The company also migrated from the data center to about 15 clusters in AWS during this time.) At the beginning of 2017, the first application taking real customer traffic was deployed in Kubernetes. The first customers were enthusiastic about their results, and word of mouth increased adoption. Today, about 70% of Zendesk applications are running on Kubernetes, and all new applications are built to run on it.
To help developers, Moter’s team enhanced the open source deployment tool, Samson, that Zendesk uses. “Samson now connects directly to each of our Kubernetes clusters, reads in the YAML files in your GitHub repo and then does a little bit of transformation and magic to them,” Moter says.
“Kubernetes seemed like it was designed to solve pretty much exactly the problems we were having.”
— JON MOTER, SENIOR PRINCIPAL ENGINEER AT ZENDESK
If, for example, there were differences in the number of replicas or amount of CPU or RAM between the production and staging environments, they can be adjusted in Samson’s UI, and Samson would make the modifications before sending it off to the Kubernetes API.
Previously, changing the resource profile of an application “was a big pain,” says Moter. “It could take a couple of days, in the best case, to get hardware changed. In Kubernetes, people can go into Samson, scale up number of replicas, tweak the CPU and RAM, hit redeploy and literally a minute or two later, they’re running with a different resource profile.”
Kubernetes has brought other time-savings benefits to Zendesk.
Recently, there was a bug in AWS that caused a lot of EC2 instances and one availability zone of one region to suddenly get terminated. Zendesk runs both Kubernetes and non-Kubernetes workloads on EC2. “Kubernetes basically self-corrected within a couple of minutes, without anyone needing to do anything,” says Moter. “Whereas with a lot of the other applications running outside of Kubernetes, teams had to manually patch some things up.”
These improvements “allow teams way more flexibility,” says Moter. “Making it easier for teams to develop and deploy microservices means that teams can choose to deploy as frequently or as infrequently as they want. The microservices are easier to reason about, to verify that all their tests pass, quicker to spin up and spin down, so teams are able to achieve a lot better velocity with that.”
“Having so many companies that either compete with each other, or are in different industries, all collaborating, sharing best practices, working on stuff together, I think it’s really inspiring in a lot of ways.”
— JON MOTER, SENIOR PRINCIPAL ENGINEER AT ZENDESK
Having a common orchestration platform also makes it way easier to have common tooling, common monitoring, and more predictable dashboards, Moter adds. “That has helped make it easier to onboard people and follow standard sorts of templates, and to set up monitors and alerting in a fairly consistent manner. And it helps a lot with on-call. We have offices scattered around the world, so for people on-call, it’s daytime at one of our offices all day.”
With the cloud native ecosystem getting more complicated, Moter’s team is starting to work on an internal Platform as a Service, a simplified interface that could work with 80% of the use cases for most teams. “Expecting every dev team to have a full understanding of Istio and Kubernetes and networking and… is asking an awful lot,” he says. (If you’re interested in working on such things, he adds, they’re hiring.)
For other organizations looking to adopt Kubernetes, Moter has this advice: “Kubernetes is amazing, but it’s a complicated system. You need to have a set of people in your organization who have a really thorough understanding of Kubernetes. And then, probably a broad swath of your organization will have maybe at least a little knowledge of Kubernetes. You can’t outsource all of that knowledge and just assume everything’s going to work. Expect a learning curve.”
By the same token, Moter adds, “Get your feet wet with a couple of small applications, but then target one of your biggest ones. It’s kind of scary, but I think if you can prove that it works for one of your more significant applications, then that gives your organization a lot of confidence. You can’t wait too late in the game to figure out how this is going to work for whatever your biggest, gnarliest problems are.”
And when you’re in the thick of that, and you get stuck, there’s always the community. “Having so many companies that either compete with each other, or are in different industries, all collaborating, sharing best practices, working on stuff together,” says Moter, “I think it’s really inspiring in a lot of ways.”