If you live on Slack, multiply your usage by millions of active users worldwide and you’ll quickly understand why the company ran into data storage challenges.
In the fall of 2016, Slack was dealing with hundreds of thousands of MySQL queries per second and thousands of sharded MySQL hosts in production. The company’s busiest database hosts had to handle all the queries from larger and larger customers, so they were running hotter and hotter while there were thousands of other hosts that were mostly idle. And the burden of manually managing the fleet of hosts in the legacy configuration was taxing the operations teams.
The Slack engineering team knew it needed a new approach to scaling and managing databases for the future. The challenges were technological, architectural, as well as process-related, and they needed to be solved while the system was running at full scale.
“We have thousands of lines of code containing SQL queries, some of which expect MySQL-specific semantics,” says Principal Engineer Michael Demmer. “That plus our years of operational experience and tooling meant that we really did not want to move away from MySQL at the heart of our database systems.” The company also wanted to continue to host its own instances running in AWS, on a platform that incorporated new technology now, but could still be extensible later. “Vitess was really the only approach that fit all our needs,” says Demmer.
The team had confidence given Vitess’s success at YouTube, and after getting familiar with the code, decided to build upon the technology in order to make it work at Slack. (Everything has been upstreamed in the open source project.) It took about seven months to go from whiteboard to production. Currently, around 45% of Slack’s overall query load-about 20 billion total queries per day, with some 500,000 queries per second at peak times-is running on Vitess. And all new features are written to use Vitess only.
“Vitess has been a clear success for Slack,” says Demmer. “The project has both been more complicated and harder to do than anybody could have forecast, but at the same time Vitess has performed in its promised role a lot better than people had hoped for.” Critical application features like @-mentions, stars, reactions, user sessions, and others are now fully backed by Vitess, “powered in ways that are much more sustainable and capable of growth,” he says. Performance-wise, the connection latency with Vitess is an extra millisecond on average, not discernable to users. “Our goal is that all MySQL at Slack is run behind Vitess,” says Demmer. “There’s no other bet we’re making in terms of storage in the foreseeable future.”
Read more about Slack’s use of Vitess in the full case study.