Meltwater's Live Migration to Cilium for Richer Features
Challenge
Meltwater is a global media intelligence company that specializes in providing businesses with real-time insights and analytics. The company offers a comprehensive suite of software tools and services to help organizations monitor and analyze online news, social media, and other digital sources.
Meltwater initially built their Kubernetes platform using the AWS VPC CNI for the networking layer but quickly ran into issues including a lack of features like network policies and encryption, AWS API rate limiting, performance problems with kube-proxy/iptables, and limited observability within their clusters. Their search for a solution also led them to look for a tool that is part of the CNCF, had strong community support, active and open development, widespread adoption, and a solid technical implementation.
Solution
Meltwater turned to Cilium as their preferred networking and observability solution to take advantage of its performance, a rich set of features, and maturity in the cloud native ecosystem. Cilium even allowed them to completely replace their existing AWS VPC CNI without requiring any interruption in service to their customers.
Impact
Cilium is now the default CNI across all of Meltwater’s Kubernetes clusters. As a result, Meltwater has experienced improved performance, created a better architecture (in particular, a reduction in API calls to AWS), access to a comprehensive range of features that they can leverage with an ecosystem of tooling around it, and improved network observability with Hubble.
By the numbers
massive scale
2+ petabytes of data, 200+ billion documents
Multi-cloud
Hundreds of Kubernetes nodes, 10k+ pods across multiple clusters
Time-to-value
23k+ customers running 200M+ searches per week
Live Migration to Cilium for Richer Features
Meltwater has about 350 Engineers spread around the globe in 40+ fully autonomous teams working on their suite of software tools and services to help organizations monitor and analyze online news, social media, and other digital sources. With a presence in over 55 countries, Meltwater serves thousands of clients across various industries, including marketing, public relations, and corporate communications. They rely entirely on AWS for their infrastructure and have a dedicated team responsible for delivering an internal developer platform known as the Foundation Mission.
The foundation mission includes a team of 9 people managing a set of multi-tenant Kubernetes clusters, consisting of several hundred nodes (6500 cores, 16TB memory), accommodating 10,000+ pods, 250 namespaces, and 3000 deployments. These clusters run both stateless and stateful workloads and serve as the primary deployment platform for the entire organization. As the platform team, their goal is to make it easy for development teams to deploy, observe, and monitor their applications.
Because Meltwater’s entire infrastructure runs within AWS, they originally opted to utilize the default AWS VPC CNI for cluster networking. However, they quickly ran into limitations including rate limiting, using too many IPs, performance issues, a rudimentary feature set, and restricted visibility into their network. Motivated by these challenges, they started searching for an alternative solution.
Meltwater conducted evaluations of various popular CNI plugins with their key criteria being a rich feature set that would set them up for the future, maturity of the solution, wide adoption and community support, and a solid technical implementation. Cilium emerged as their preferred choice due to its extensive features and widespread adoption within the cloud native ecosystem.
“We evaluated the feature set of every CNI and Cilium got our vote. With it coming into the CNCF and every major cloud provider adopting it too, it looked like we were not the only ones appreciating Cilium’s technical superiority. All of this plus the great pace of development of the project pointed to Cilium being a great solution that we don’t have to change in a year or two because changing your CNI is not fun.”
Simone Sciarrati, Principal Engineer, Meltwater.
After making their decision, Meltwater replaced the AWS VPC CNI with Cilium by doing a live migration. They used labels and node selectors so that the two CNIs could live side by side in the same cluster until all nodes had been migrated.
Implementing Cilium allowed Meltwater to replace kube-proxy to improve performance, reduce calls to the AWS API, and address their IP address management problem.
“We had to rethink how we managed IP addresses in our network space because even though we had thousands of IPs, we were using them inefficiently. It would have been like network surgery to add more CIDR blocks to what we had. By switching to Cilium, we were able to reduce the initial IP allocation and get out of a very tricky situation,” said Sciarrati.
Better Observability with Hubble
Once the platform team had Cilium set up, they also enabled Hubble and Prometheus metrics for better observability. This wasn’t just about overcoming their initial visibility struggles with the AWS VPC CNI; it was about embracing a new level of network management and observability.
“With Hubble and Prometheus metrics, we now know better what’s going on in the network and can debug issues much easier. For our development teams, just assigning IPs is of limited use. Providing visibility into network traffic and allowing them to investigate issues enables the CNI to provide actual value to our users – not just the platform team,” said Sciarrati.
Enabling Hubble not only gave the team at Meltwater advanced visibility into their network but also equipped them to swiftly identify and debug issues. With its comprehensive monitoring capabilities, Hubble played a pivotal role in empowering the team to quickly pinpoint problems and undertake timely resolutions.
“Besides just the UI, Hubble is an easier way to debug network issues and see network traffic. I don’t need to use tcpdump anymore. We understand better what is going on between all the different components and workloads in our system. Development teams can now understand why their applications aren’t working. With Hubble UI, it’s so much easier for us to go to a UI and visualize what’s going on, troubleshoot, and find out what’s going wrong quickly. Troubleshooting is more efficient, faster, and easier.”
Federico Hernandez, Principal Engineer, Meltwater
Looking ahead, the platform team at Meltwater has ambitious plans for Hubble. Their vision is to make the tool available to every team within the organization.
Creating a Comprehensive Networking and Observability Platform with Cilium
Cilium is a significant success for the platform team at Meltwater, effectively catering to their networking and observability needs by providing a comprehensive solution.
“Cilium not only makes our situation a lot more stable, but more importantly it opens doors to many features. By removing things like kube-proxy and nginx, Cilium helped us centralize and move functionality within the CNI without having to run anything extra. Removing these results in fewer pieces to manage and reduces our costs. For the future, we are looking at Cluster Mesh with blue-green deployments to get to the point where we can swap entire clusters under workloads without users noticing. Cilium has already solved a lot of issues, but what it opens up is even better.”
Simone Sciarrati, Principal Engineer, Meltwater