Case Study

Uber

How Uber is monitoring 4,000 microservices with its open sourced Prometheus platform

Challenge

With 4,000 proprietary microservices and a growing number of open source systems that needed to be monitored, by late 2014 Uber was outgrowing its usage of Graphite and Nagios for metrics. “A lot of teams were using pre-packaged Graphite monitoring software and trying to write scripts in Nagios to check the metrics that were being collected from these packages, and it was just very hard to maintain at scale,” says Rob Skillington, Technical Lead for Metrics and Systems Monitoring.

Solution

“We ultimately chose Prometheus because the client libraries and features were what engineers at Uber wanted to work with,” says Skillington. His team also built and open sourced the M3 platform, a turnkey, scalable, and configurable store for Prometheus metrics. M3 now houses over 6.6 billion time series at Uber, aggregates 500 million metrics-per-second, and persists 20 million resulting metrics-per-second to storage globally.

Impact

With its use of Prometheus and M3, Uber’s storage costs for ingesting metrics became 8.53x more cost effective per metric per replica. The team estimates that setting up monitoring systems in Uber data centers for its Advanced Technologies Group was 4x faster than it would have been under the previous process. Plus, the team is now 16.67x less burdened by operational maintenance: The number of combined high/low urgency notifications per week went from 25 for Cassandra to 1.5 for M3DB.

Challenges:
Industry:
Location:
Cloud Type:
Product Type:
Published:
February 5, 2019

Projects used

By the numbers

Cost savings

Storage is now 8.53x more cost effective per metric per replica

Time savings

4x faster to set up monitoring systems in data centers

6.6 billion time series stored, 500 million metrics-per-second aggregated

In just seven years of existence, Uber has become an everyday convenience in more than 700 cities around the world.

To help manage its exponential growth and resulting scale—the mobile app has connected riders and drivers more than a billion times—the company began breaking down its monolith into microservices.

But what started out as a few dozen soon became 4,000 proprietary backend microservices that needed to be instrumented for monitoring, alerting, and anomaly detection. On top of that, Uber wanted observability into the systems the services operate on, such as Ubuntu, as well as open source software like MySQL, Cassandra, Redis, Etcd, ZooKeeper, and Kafka, which were all running on a combination of the company’s on-premise data centers, AWS, and GCP. In the face of this complexity, “We were kind of building our own systems and components for monitoring” using Graphite and Nagios, says Rob Skillington, Technical Lead for Metrics and Systems Monitoring.

By late 2014, it had become clear that Uber had outgrown this DIY setup. “A lot of teams were using pre-packaged Graphite monitoring software and trying to write scripts in Nagios to check the metrics that were being collected from these packages, and it was just very hard to maintain at scale,” says Skillington. “It also came down to the volume of metrics that were being generated by all these extra services, and the fact that Graphite wasn’t able to scale in terms of the replication and management of the stack. It was far less dynamic and required a lot of manual operation and downtime during any changes that we needed to make.”

Skillington’s team evaluated several technologies, including Atlas and OpenTSDB, but the fact that a growing number of open source systems were adding native support for the Prometheus Metrics Exporter format tipped the scales in that direction. “We ultimately chose Prometheus because the client libraries and features were what engineers at Uber wanted to work with,” says Skillington. “It was clear that using standard Prometheus exporters was far better than writing and maintaining our own. And in general, we liked the ecosystem and supporting infrastructure that was being created by the community.”

Plus, he adds, “it was important that the project was hosted at CNCF, as that meant we were confident there would be a strong community around it for some time. The open governance and wide industry participation helped us feel at ease that Prometheus would be compatible with almost any popular open source software that we would need to monitor now and in the future.”

With that decision made, the team looked for an open source alternative to the company’s existing metrics platform. Finding none that could run as a self-service platform or would meet the company’s goals for resource efficiency or scale, the team built and open sourced the M3 platform, a turnkey, scalable, and configurable store for Prometheus metrics. “At first M3 leveraged almost entirely open source components for essential roles such as statsite for aggregation, Cassandra with Date Tiered Compaction Strategy for time series storage, and Elasticsearch for indexing,” says Skillington. “Due to operational burden, cost efficiency, and a growing feature set, we gradually outgrew each one.” Over time, Uber developed replacement components: M3DB, M3 Query, M3 Coordinator, and M3 Aggregator, which are all open sourced as part of M3.

“Prometheus added a ton of very high quality libraries and common monitoring metrics exporters, and the way in which it exported its metrics made it very easy for us to continue to pull in existing software and use that at scale.”

— ROB SKILLINGTON, TECHNICAL LEAD FOR METRICS AND SYSTEMS MONITORING AT UBER

Uber’s M3 platform now houses over 6.6 billion time series, aggregates 500 million metrics-per-second, and persists 20 million resulting metrics-per-second to storage globally.

With its use of Prometheus and M3, Uber’s storage costs for ingesting metrics became 8.53x more cost effective per metric per replica. The team estimates that setting up monitoring systems in Uber data centers for its Advanced Technologies Group was four times faster than it would have been under the previous process. “For systems that had native Prometheus support for their metrics, it took almost zero time for us to onboard, versus some fixed amount of time that it would require for us to go in and instrument by hand ourselves,” says Skillington. Plus, the team is now 16.67x less burdened by operational maintenance: The number of combined high/low urgency notifications per week went from 25 for Cassandra to 1.5 for M3DB.

Given these results, Skillington’s team is working on speeding up adoption of Prometheus and M3 at Uber. Already, all metrics are stored in M3, and any open source software run either on premise or in the cloud is mostly monitored by Prometheus Metrics Exporters. Up to 10% of Uber’s proprietary services are using Prometheus metrics client libraries. Skillington is hoping to see Prometheus and OpenMetrics, which just became a CNCF sandbox project, converge with a single client library providing both formats. Over time, Skillington says, “we would like to transition all of our proprietary services and any of the remaining open source software that we don’t already monitor with Prometheus/OpenMetrics to use it.”

“The open governance and wide industry participation helped us feel at ease that Prometheus would be compatible with almost any popular open source software that we would need to monitor now and in the future.”

— ROB SKILLINGTON, TECHNICAL LEAD FOR METRICS AND SYSTEMS MONITORING AT UBER

To that end, Skillington says increased integration with Prometheus is a priority, “both in terms of providing observability for any application that exports Prometheus metrics and for systems monitoring using node_exporter or other third party Prometheus metrics exporters.” His team is also making sure than any environment that’s run outside of the vanilla Uber offering will expose Prometheus metrics and have a standard Prometheus setup. Additionally, “We’re looking to make it easier for teams that have no experience at all running Prometheus or M3 to run their own,” says Skillington. “This type of software doesn’t need to complicated to operate.”

For other organizations starting down this monitoring path, Skillington has some simple advice: “Don’t solve problems that have already been solved,” he says. “Most people evaluate open source metrics infrastructure in a way that’s completely end-to-end. That is not really the case anymore. There’s a lot of interoperability between systems out there today, and it’s best to really solve the piece that’s unique to your platform and setup.”

That was Uber’s mission with M3, and now the team is happy to share it with others. “Like others have said, we’re not really in the business of writing or making money off metric systems, so we’d love for the community to take our M3 platform and use it. Hopefully it helps the road map as well.”