How Slack leverages Vitess to keep pace with ever-growing storage needs
Challenge
In the fall of 2016, Slack was dealing with hundreds of thousands of MySQL queries per second and thousands of sharded MySQL hosts in production. The company’s busiest database hosts had to handle all the queries from larger and larger customers, so they were running hotter and hotter while there were thousands of other hosts that were mostly idle. And the burden of manually managing the fleet of hosts in the legacy configuration was taxing the operations teams. The Slack engineering team knew it needed a new approach to scaling and managing databases for the future.
Solution
In looking for a scalable storage solution, the team had certain requirements. They didn’t want to move away from MySQL, and they wanted to continue to host Slack’s own instances running in AWS, on a platform that incorporated new technology now, but could still be extensible later. “Vitess was really the only approach that fit all our needs,” says Principal Engineer Michael Demmer.
Impact
“Vitess has been a clear success for Slack,” says Demmer. “The project has both been more complicated and harder to do than anybody could have forecast, but at the same time Vitess has performed in its promised role a lot better than people had hoped for.” Critical application features like @-mentions, stars, reactions, user sessions, and others are now fully backed by Vitess, “powered in ways that are much more sustainable and capable of growth,” he says. Performance-wise, the connection latency with Vitess is an extra millisecond on average, not discernable to users. “Our goal is that all MySQL at Slack is run behind Vitess,” says Demmer. “There’s no other bet we’re making in terms of storage in the foreseeable future.”
By the numbers
Query rate
~500,000 queries per second at peak times
Query load
~20 billion total queries per day
Added connection latency with Vitess
is only around
1 millisecond on average
Since its launch in 2013, Slack has grown into an essential collaboration tool for more than 10 million weekly active users all over the world, at organizations ranging from startups to some of the biggest Fortune 500 companies.
By the fall of 2016, that was translating into hundreds of thousands of MySQL queries per second and thousands of sharded MySQL database hosts in production.
But the biggest challenge wasn’t just the number of queries per second; it was the fact that people tend to stay on Slack for a long time with periodic spikes in activity. With millions of users simultaneously connected at any given time, the company found that “a lot of our scaling challenges have to do with managing the flow of connections coming and going, and in the workloads that occur when a lot of people on a team or a workspace do a lot of coordinated efforts at once,” says Principal Engineer Michael Demmer. “With our legacy architecture, all the load from a given customer was concentrated on a single database cluster. Since Amazon only has a certain maximum instance size that we can buy, when a customer outgrows the capabilities of one host, we don’t really have anywhere to go.”
The Slack engineering team knew it needed a new approach to scaling and managing databases for the future. “Slack’s approach to data sharding had been very centered around the concept of a workspace, which had served the company well as it scaled to more and more customers,” says Demmer. “However, this approach was no longer the best fit as our product needs evolved to include more cross-workspace interactions and the size of individual customers got larger and larger.”
For instance, a few years ago, Slack released an enterprise product that enables customer workloads to be divided into several different Slack workspaces that are connected together, and recently released shared channels, which allow customers to share messages and files in a channel with other customers’ workspaces. With these product changes, Slack user interactions no longer mapped cleanly to a single database shard, so sharding by teams wasn’t a sustainable strategy.
At the same time, the company’s busiest database hosts had to handle all the queries from larger and larger customers, so they were running hotter and hotter, yet there were thousands of other hosts that were mostly idle. And the burden of manually managing the fleet of hosts in the legacy configuration was taxing the operations teams.
In looking for a scalable storage solution, the team had certain requirements. “We have thousands and thousands of lines of code that are written in specific SQL queries, some of which expect MySQL-specific semantics,” he says. “So we really did not want to move away from MySQL at the heart of our database systems.” The company also wanted to continue to host its own instances running in AWS, on a platform that incorporated new technology now, but could still be extensible later. Given the complexities of the features Slack was developing, Demmer believed the technology would need to offer a fine-grained, flexible “shard by anything” solution, which could also help automate the splitting as any host starts to approach maximum storage size.
And, Demmer says, “no database storage system other than Vitess truly fit all of Slack’s needs.”
“No database storage system other than Vitess truly fit all of Slack’s needs.”
— MICHAEL DEMMER, PRINCIPAL ENGINEER AT SLACK
The team had confidence given Vitess’s success at YouTube, and after getting familiar with the code, decided to build upon the technology in order to make it work at Slack. “It was clear that Vitess was not an out-of-the-box, perfect thing that would just be installed and would work to solve all of our needs,” says Demmer. “The things that Vitess does, it does well. However, like any other complex piece of technology, there is going to be some amount of work for your own application to figure out how to use it best, or in some cases, to adapt the technology to your needs.”
The team spent quite some time working on the query-planning and execution engine, fault isolation features, and support for Prometheus metric exports. They also wrote a query simulation engine, vtexplain, to help Slack developers understand how queries will perform in Vitess. Everything has been upstreamed in the open source project.
All in all, it took about seven months to go from whiteboard to production. Currently, about 45% of Slack’s overall query load—about 20 billion total queries per day, with some 500,000 queries per second at peak times—is running on Vitess. And all new features are written to use Vitess only. “Vitess has been a clear success for Slack,” says Demmer. “The project has both been more complicated and harder to do than anybody could have forecast, but at the same time Vitess has performed in its promised role a lot better than people had hoped for.”
Part of the difficulty had to do with the team’s decision to change three things at once: how the data is stored, what format to use for each row in the code, and the migration from the legacy system to Vitess. “That meant that we accomplished three goals with one migration, but it also meant that we ended up with three times the complexity with that migration,” says Demmer.
“Vitess has been a clear success for Slack. The project has both been more complicated and harder to do than anybody could have forecast, but at the same time Vitess has performed in its promised role a lot better than people had hoped for.”
— MICHAEL DEMMER, PRINCIPAL ENGINEER AT SLACK
But the results show the effort was well worth it. Today, critical application features like @-mentions, stars, reactions, user sessions, and others are now fully backed by Vitess, “powered in ways that are much more sustainable and capable of growth,” he says. “There are absolutely features that we have written only because we had the capability that Vitess enabled.” Performance-wise, the connection latency with Vitess is just an extra millisecond on average, not discernable to users.
A team of around eight full-time engineers is actively working on migrating more applications to Vitess and improving Slack’s database operations automation. Hundreds more developers are interacting with Vitess on a daily basis to build product features. “Our goal is that all MySQL at Slack is run behind Vitess,” says Demmer. “There’s no other bet we’re making in terms of storage in the foreseeable future.”
For other organizations looking at Vitess, Demmer has this piece of advice: “Go in knowing this is a relatively new system that can bring huge benefits, but will for sure have interesting things that you will discover, so make sure that you understand what that process will be like before going down that road.”
The team at Slack is continuing to pave the way. “The community is really strong, and I think the work that’s being done on the applicability and ease of use, and simplifying all of the components, will make adoption a little easier for people,” Demmer says. “Vitess is a great project.”