How T-Mobile is leveraging Kubernetes to handle iPhone-launch scale
Challenge
The third-largest wireless carrier in the U.S., T-Mobile found itself not agile enough to keep up with its business goals: In 2015, it took seven months to get new code to production. The following year, the company adopted Pivotal Cloud Foundry. While the migration of the most critical applications yielded great results—the time to production shrank to less than a day—not all applications, particularly those in Docker containers, ran smoothly on the PaaS. “We needed a container orchestrator that met a certain set of requirements both not just in terms of the value provided to the application teams, but also our ability to manage it,” says James Webb, Member of Technical Staff. The main requirements: high availability at every level, persistent storage, and the ability to patch and upgrade the infrastructure seamlessly without any impact to customers.
Solution
The team spent six months working with an outside company to build a completely open source Kubernetes platform for T-Mobile, but when Pivotal rolled out PKS, they decided to switch over, with the goal of being ready for the peak retail season starting in late Q3 2018. During the new iPhone launch last September, a small amount of production traffic was running on Kubernetes. This September, says Brendan Aye, Director, Platform Architecture, “we’ll have a huge portion of apps, especially in the sales path for iPhone, running on Kubernetes.”
Impact
T-Mobile is still early in its adoption phase for Kubernetes, but “teams will go from five or six days of waiting time, to five or six seconds of copying the yaml and pasting it and deploying something, or clicking the button and getting that immediately,” says Aye. Enabling that across the enterprise will have a great impact on speed of delivery. “As they’re able to control these services a lot closer to their own team, they’ll have more ability to leverage capabilities to run their applications in an active manner across all the different hardware regions we provide, whether it’s on-premise in one of our two data centers, or in multiple cloud providers,” says Aye.
By the numbers
Getting a new database
Went from 5 days to 5 seconds
95% of deployments done in daytime with zero impact
Team of 25 supporting
700 developers
In 2015, it took T-Mobile seven months to get new code to production. That certainly wasn’t the speed of delivery that the third-largest wireless carrier in the U.S. needed to keep up with its business goals, so the following year, the company adopted Pivotal’s Platform as a Service offering, Pivotal Cloud Foundry.
“It solved a lot of problems for us,” says Brendan Aye, Director, Platform Architecture, “because we could have very consistent experiences getting code from Dev to Test to Prod.”
Aye and James Webb, Member of Technical Staff, spent most of 2017 decomposing T-Mobile’s most critical applications into microservices and migrating them to Cloud Foundry. The migration yielded great results: The seven-month time to production shrank to less than a day. By the end of the year, 250 million calls per day were going through Cloud Foundry, primarily via homegrown apps in the stateless middleware layer.
But not all applications, particularly vendor-delivered ones that were shipped to T-Mobile in Docker containers, ran smoothly on PaaS during updates. “So we started to look at what kind of options we had to run more ad hoc containers,” says Webb. “We needed a container orchestrator that met a certain set of requirements both not just in terms of the value provided to the application teams, but also our ability to manage it.” The main requirements: high availability at every level, persistent storage, and the ability to patch and upgrade the infrastructure seamlessly without any impact to customers.
“Kubernetes checked a lot of those boxes,” says Webb, and by that point, “it had become the dominant force.” Aye and Webb’s team spent six months working with an outside company to build a completely open source Kubernetes platform for T-Mobile, but when Pivotal rolled out PKS, they decided to switch over. “We deploy Cloud Foundry in a very specific way, and if we could do the same thing with Kubernetes, that gives us a lot of efficiencies in terms of how we operate, the automation we build, the monitoring we do,” says Aye. “It was win-win-win.” Their goal was to be ready for the peak retail season starting in late Q3 2018. During the new iPhone launch last September, a small amount of production traffic was running on Kubernetes. “We’re looking to hockey stick in 2019 in terms of the applications put on board,” says Webb. This September, Aye says, “we’ll have a huge portion of apps, especially in the sales path for iPhone, running on Kubernetes.”
Indeed, the mandate to move to containers and cloud native is coming from a very high level at T-Mobile, and “we’re using Kubernetes to bring more container goodness to the enterprise,” says Webb. In the coming months, the team expects various mission-critical applications to be migrated to the Kubernetes platform, including retail store apps like mobile point of sale that enable sales associates at T-Mobile stores to ring up your purchases on their iPads. But there are many other types of applications that could potentially end up on the platform, so “we’re trying to build a system that we can optimize, make it good for everything generic, and then as we start getting more individual use cases, that’ll help us focus on problems that we find as we go,” says Webb.
With the promise that Kubernetes brings, people are lined up to onboard. But there’s a learning curve, and Webb and Aye’s team is evolving from platform engineers to customer success engineers, with a strategy of “building an internal center of excellence for how to use Kubernetes, how to do cloud native development, and then make sure those folks share that knowledge with others in your company,” Webb says. That means working with Pivotal to develop some internal training, but the philosophy is that not everyone needs to be a Kubernetes expert. “Your values comes from delivering features for your application, not from learning Kubernetes and deploying containers,” says Aye. “If you don’t have to manage these things, you shouldn’t try to.”
“We’re using Kubernetes to bring more container goodness to the enterprise.”
— JAMES WEBB, MEMBER OF TECHNICAL STAFF AT T-MOBILE
T-Mobile is still early in its adoption phase for Kubernetes, but “we’ve seen the impact of running large-scale platforms that manage containers and removing the inefficiencies from our development process,” says Webb. “We expect Kubernetes to have the same impact; it’s just a matter of time.”
Moving to containers and Cloud Foundry already reduced the time for production deployments from 7 months to same day, and 95% of deployments are now done in daytime with zero impact. With Kubernetes, Aye has additional expectations. “We’ll see improvements around the ability to deliver not just the internally developed services, but also other services to support the business much more quickly,” he says. “Instead of saying we’re a team that now gives you a database, it’s going to be: We’re a team that provides an operator that allows you to give yourself a database. So these teams will go from five or six days of waiting time, to five or six seconds of copying the yaml and pasting it and deploying something, or clicking the button and getting that immediately.”
Enabling that across the enterprise will have a great impact on speed of delivery. “As they’re able to control these services a lot closer to their own team, they’ll have more ability to leverage capabilities to run their applications in an active manner across all the different hardware regions we provide, whether it’s on-premise in one of our two data centers, or in multiple cloud providers,” says Aye.
“It was hard for us to find people that had Kubernetes or Cloud Foundry experience that were not already employed. So as much as possible, we’ve tried to upscale people in the rest of the org who were willing to learn.”
— BRENDAN AYE, DIRECTOR, PLATFORM ARCHITECTURE AT T-MOBILE
In the meantime, there’s already been a positive impact on the culture at T-Mobile. At the beginning of this initiative, Webb and Aye were on their own, with just a couple of people added during the migration process in 2017. Since then, they’ve added more than 20 more people to the platform engineering team. “It was hard for us to find people that had Kubernetes or Cloud Foundry experience that were not already employed,” says Aye. “So as much as possible, we’ve tried to upscale people in the rest of the org who were willing to learn. Even though they were spending time on the Unix operations team, or maybe the Puppet build team, we brought them into our team. We found people who were really passionate problem solvers, were eager to learn, and had an obsessive customer focus.”
Adds Webb: “It’s a testament to how important the company feels the products are to their success in future, that we have that flexibility.”
Just as importantly, the changes brought by containerization and Kubernetes have been seen in the mindset within the company. “If something goes wrong, you know you can deploy a new version, and if it breaks spectacularly, you can roll back to the old version within minutes or seconds,” says Aye. “So you can take on more risk because of this, and try new things out. From leadership all the way down to the individual contributors, we’ve seen they’re much more willing to take on this risk.”