Solving Network Stability Issues with Cilium
Challenge
Sicredi is Brazil’s first credit union with more than 120 years of experience in the business and over 7.5 million members. With more than 300 products and services in the Sicredi operating model, the funds raised are reinvested in the region. Thus, Sicredi positively impacts the community, stimulating income generation and sustainable growth.
The Sicredi platform team initially developed their Kubernetes platform using Flannel and then Weave as their CNI for networking. However, they soon encountered challenges, including networking and infrastructure issues that caused significant problems in production, like network outages. Recognizing these issues, they began searching for a potential solution to fix their problems.
Solution
After an extensive evaluation, the Sicredi platform team chose Cilium to address the performance and networking issues they encountered. They selected Cilium for its performance and use of eBPF as they were seeking a performant and stable networking solution. They were simultaneously evaluating service mesh solutions and liked that Cilium also had a service mesh built in.
Impact
With Cilium, Sicredi has a stable and reliable network to build their platform on. Cilium has also given them the ability to run the same applications across multiple data centers and their cloud provider with Cluster Mesh. These have facilitated the deployment of an increasing number of microservices, which in turn has given them the capacity to easily create new capabilities for their business and serve their customers better.
By the numbers
500+
Kubernetes nodes
4000+
Apps on the infrastructure
13000+
Instant payment transactions/minute
Solving Network Stability Issues with Cilium
Sicredi has a dedicated platform engineering team of five individuals tasked with overseeing their Kubernetes platform. This platform operates across both OpenStack servers in three on-premise data centers and EKS on AWS.
The Sicredi platform engineering team embarked on their Kubernetes journey in 2018, originally choosing Flannel as their CNI (Container Networking Interface) for networking. However, not long after implementation, they faced significant challenges, including network outages in production. In search of a solution to these persistent issues, they initially considered Weave and decided to adopt it. Unfortunately, the project was discontinued and ceased to receive updates.
“We started using Kubernetes in 2018 and used Flannel as a CNI. However, we faced some networking issues in our infrastructure and tried to move to Weave. Unfortunately, the project stopped shipping new versions and we were still facing network outages in production. With these issues, we started to look for a stable new solution to work as our CNI.”
Matheus Morais, IT Infrastructure Analyst, Sicredi
With Weave no longer maintained, the Sicredi team began searching for a new CNI. They were interested in a solution that utilized eBPF, believing it could help address some of their resource exhaustion issues. At the same time, they were also looking for a high-performance service mesh solution.
“We were looking for improved networking performance because we had numerous scenarios where we were using too many resources – a lot of CPU and memory. We believed that using eBPF might alleviate that and other things that were impacting our network. We also saw eBPF’s potential in other areas too, like observability and security.”
Matheus Morais, IT Infrastructure Analyst, Sicredi
“At that time, service mesh was also a hot topic. As we were in search of a replacement for our CNI, discussions about service mesh were already underway within our company.”
Vinicius Castro, Platform Engineer, Sicredi
After choosing Cilium, Sicredi removed Weave completely from their clusters, installed Cilium with Helm and migrated all their clusters to Cilium. They haven’t had network stability problems since.
“We removed Weave completely and used Helm to install Cilium. We drained each node of our clusters and made a complete reboot of all interfaces from Weave to Cilium. All of our customers are based in one timezone so we could do the change overnight.
Matheus Morais, IT Infrastructure Analyst, Sicredi
Since installing Cilium, we’ve had a very stable network for our platform. This is the key point, we haven’t had any kind of problems related to Cilium besides minor bugs as we moved from version to version and that points to the stability that Cilium gives us.”
From Cluster Mesh to Service Mesh
During their search for a new CNI, Sicredi was also in the market for a service mesh solution. They decided to adopt Cilium, as its Cluster Mesh feature was available and met their requirements for connecting applications across clusters.
“Cilium Cluster Mesh gave us possibilities that we didn’t have before. We can run the same application across our data centers and AWS. It provides us with a consistent networking experience wherever we need to go. Applications in different clusters can communicate with each other without needing to go through an ingress controller. We are expanding this capability for more applications because this value is already proven. Cluster Mesh works, it’s fast, and it’s reliable.”
Matheus Morais, IT Infrastructure Analyst, Sicredi
Once Cluster Mesh was up and running, they also looked towards a service mesh to secure interservice communication and once again landed on Cilium.
“We started using Cilium in our EKS clusters and encountered security challenges. So many microservices were running without any kind of authentication or encryption. This prompted us to start doing a service mesh proof of concept.
Matheus Morais, IT Infrastructure Analyst, Sicredi
Although we already had Cilium available, we also evaluated Consul Connect, Kong Mesh, and Istio. Cilium stood out as it did not require sidecars, appealing to us not only for resource savings but also to avoid potential load issues caused by additional software. Cilium’s sidecar-less approach did call our attention, leading us to adopt it to provide security across all our microservices.”
The Sicredi team now uses Cilium Service Mesh to secure communication between services.
“We leverage the service discovery feature already embedded within the service mesh to protect our services. Rather than controlling everything through ingress, Service Mesh allows us to use service discovery across clusters and use the Kubernetes names to enforce Cilium Network Policy deny or allow on communication flows. It is particularly beneficial for our developers because it is transparent to them.”
Matheus Morais, IT Infrastructure Analyst, Sicredi
The Sicredi team also utilizes the Cilium ingress controller for their gRPC applications and is looking to expand their usage with Gateway API.
“We were having problems with the network load balancer and nginx ingress controller at AWS and we found a solution for our problems using the Cilium Ingress controller, primarily for gRPC applications. We are also looking forward to Gateway API improvements to further improve our ingress capability.”
Vinicius Castro, Platform Engineer, Sicredi
Unlocking New Business Capabilities with Cilium
Sicredi’s adoption of Cilium addressed their network stability and multi-cloud challenges. This move also enabled the creation of new capabilities for their business.
“Cilium is a critical piece of software that gives us the ability to have more and more microservices, enabling new business capabilities. We don’t have any restrictions on creating new services running inside Kubernetes clusters because Cilium provides us with a great network that is capable of handling scale.
Matheus Morais, IT Infrastructure Analyst, Sicredi
Sicredi also has a multi-cloud strategy and Cilium Cluster Mesh is crucial for delivering this to the business.”
Future Plans for Enhancing Connectivity and Security with Cilium
With Cilium as a key part of their platform, the Sicredi team already has some future plans for Cilium. They plan to test out Gateway API and Mutual Authentication, replace their nginx ingress with the Cilium Ingress controller, and enable Hubble for their developers.
“We would like to test Cilium’s Gateway API implementation and we are looking into the possibility of migrating our nginx Ingress to Cilium Ingress. We’ve also been having conversations with our monitoring team and they want to upgrade from our current observability tool, Dynatrace. We have started enabling Hubble in our development environments to make some tests and see if Hubble can help with observability. Cilium’s mTLS-based Mutual Authentication is also something we were considering for the future.”
Matheus Morais, IT Infrastructure Analyst, Sicredi