The company behind the popular open source Grafana project, Grafana Labs offers customers a hosted metrics platform called Grafana Cloud, which incorporates Metrictank, a Graphite-compatible metrics service, and Cortex, the CNCF sandbox project for multitenant, horizontally scalable Prometheus-as-a-Service.
Grafana Labs engineers run Metrictank and Cortex to troubleshoot their own technical issues. But as the company started adding scale-Cortex and Metrictank each process tens of thousands of requests per second-query performance issues became noticeable. That latency negatively impacts Grafana Cloud customers’ user experience.
Without a way to visualize the path of requests end-to-end, the team attempted to solve the problem by guessing the cause of the slowness and rolling out a “fix”-“many times shooting in the dark, only to have our assumptions invalidated after a lot of experimentation,” says Software Engineer Goutham Veeramachaneni.
The Metrictank team had already been using Jaeger distributed tracing to understand requests better and to see all logs in one place. With that experience using Jaeger, “we doubled down on it with Cortex to improve the query performance,” says VP of Product Tom Wilkie. Jaeger allowed the team to drill down to specific requests and quickly find the queries that were causing latency. The results with Jaeger were stellar: Query performance was improved by as much as 10x.
As it turned out, Jaeger has also helped the Grafana Labs team with bug-hunting. “It’s easier to visualize where the problems are, and it just made me more confident at tackling things because I’m able to see exactly what’s going wrong,” says Veeramachaneni. With Jaeger in place, “the confidence in operating our system grew by an order of magnitude.”
Read more about Grafana Labs’ use of Jaeger in the full case study.