Building a Resilient Payments Platform with Cilium
Challenge
Form3 is a global financial technology company that offers a powerful, managed payment technology platform. Their platform is designed to make it easier for banks to integrate into payment schemes by using a single API while still meeting all compliance requirements. This API connects to various payment schemes, reducing the time to market, effort, and maintenance banks need to move money.
When Form3 was developing its fast payment service (FPS) gateway, they had a key requirement: they needed to be able to switch between data centers with zero downtime. To solve this, they needed to connect their Kubernetes clusters in a way that allowed for a unified database, message broker, and global service availability. This setup would enable them to keep the same database and message flow, even when switching between data centers.
Solution
Form3 chose Cilium as their go-to networking, security, and observability solution. They made this decision after evaluating different options in the market and finding that Cilium was the only one that could easily connect their on-premise clusters, thanks to its cluster mesh feature. Cilium also allowed them to control access to their pods to lock down their environments.
Later, as they were developing their multi-cloud payment platform, Form3 relied on Cilium to run active-active-active across AWS, Google, and Azure in real time.
Impact
Integrating Cilium into their platform allowed Form3 to meet their FPS gateway business requirement of switching between data centers with zero downtime. The use of Cilium network policies enhanced the control over pods and improved the security of their workloads. Using Cilium across different clouds also simplified the process and learning curve for their multi-cloud payment platform because they didn’t have to implement a new CNI for each cloud. Ultimately, Cilium played a crucial role in helping Form3 meet compliance requirements and achieve their objective of zero downtime with maximum resiliency.
By the numbers
Billions
of packets moved securely
Hundreds
of workloads secured using network policies
three
Cloud vendors
Building a Resilient Payments Platform with Cilium
Form3 is a financial technology company based in the UK which offers a single API that allows banks to easily access multiple payment schemes to move customers’ funds between banks.
Their Faster Payment Service (FPS) gateway platform is built on a microservice architecture running across three Kubernetes clusters for high availability. Across these clusters, there are hundreds of pods running and the setup is replicated across development, staging, and production environments.
As a requirement for operating an FPS Gateway service in the UK, Form3 needs to have their workloads running in two physical data centers in London. Each data center needs to be capable of handling all the traffic if the other data center fails. To solve this problem Form3 chose a setup where they run a Kubernetes cluster in each data center and a third cluster in the cloud. The clusters are linked together using Cilium’s cluster mesh technology which allows Form3 to run a single logical database and message broker across all three clusters. This allows for the FPS gateway to carry on processing payments seamlessly if any of the data centers (and therefore clusters) are lost.
“Our only other option would have been to run a single Kubernetes cluster across all the data centers. But that design would be hard to run with a control plane highly available across locations. It would also give us less resilience because if something goes wrong with the cluster, we could lose the whole system. Having separate clusters that are independent of each other allows them to fail independently and it doesn’t affect the whole system,” said Kevin Holditch, VP of Engineering: Platform at Form3.
The value Cilium added to our FPS platform was massively simplifying our problem. If Cilium did not exist, it would have been much tougher to solve that requirement of being able to switch off one data center and have everything carry on running. There wasn’t any other option on the market that solved the problem in the same way that Cilium did. Cluster Mesh became key to our design for resiliency and underpins how we run our entire infrastructure.”
Kevin Holditch, VP of Engineering: Platform
Once they had Cilium set up, Form3 also started to use Hubble for observability and debugging.
“Our main use case for Hubble is debugging. Occasionally in our FPS platform, we must do maintenance where rebuild the clusters every five weeks to patch the machines. At times when we try to bring those clusters back, we can have network problems. Hubble is a great tool because it enables us to see and identify where the problem is,” said Holditch.
Unlocking New Services with Cilium
Following the success of Cilium in their FPS Gateway Kubernetes clusters, Form3 began developing a multi-cloud platform. This decision was primarily driven by the demands of larger banks being onboarded onto their platform. Government regulators required these banks to provide an exit strategy to ensure they were not tied to any specific cloud vendor because they didn’t want an outage to affect the UK economy. Consequently, these banks passed this requirement on to Form3.
To solve this problem Form3 chose to copy the successful design of the FPS Gateway using three Kubernetes clusters. However, in this instance, they ran a Kubernetes cluster in each of the three major cloud vendors: AWS, Azure and GCP. Form3 privately networked the clouds together to allow pods in one cluster to talk to pods in another cluster. Because of the success Form3 had using Cilium as their CNI in the FPS Gateway, they chose to use Cilium as their CNI in their multi-cloud platform.
This enabled them to have a single CNI across every Kubernetes cluster at Form3 making support much easier. It also enabled them to leverage Cilium’s security features such as network policies to tightly control every pod in the cluster’s network traffic.
“When we started building our multi-cloud platform using Kubernetes we chose Cilium as our CNI so we could run multi-cloud with the same setup. We wanted a consistent CNI across all cloud providers to avoid needing to learn a different one for each environment. Additionally, we appreciated the enhanced security features Cilium offered, allowing us to strictly control what pods can and cannot do, a level of security not available with native Kubernetes network policies,” said Holditch.
With their basic infrastructure set up, they next focused on security and observability of the platform. “We use Cilium network policies to tightly lock down our workloads because we’re in a payment environment. It gives you good isolation of pods and can tightly lock down pod DNS lookups which is another business requirement for us. Cilium adds a lot of security to our multi-cloud estate,” said Holditch.
“Hubble is also important because when you are moving packets across clouds, you need visibility on how far packets are getting, what’s happening, or whether they’re being blocked by some kind of network policy. Hubble’s observability is a massive value add that just comes with the CNI.”
Now, they operate Kubernetes clusters with three major cloud providers: AWS, GCP, and Azure. Cilium is used as the uniform CNI across these clouds, enabling them to use Hubble for debugging and Cilium network policies for securing their workloads.
“As far as I know, there are not too many companies running a true multi-cloud setup with workloads spanning multiple cloud vendors in real-time, with an active-active setup. Cilium has allowed us to be quite groundbreaking in that regard.”
Kevin Holditch, VP of Engineering: Platform
Meeting Business and Regulatory Requirements
Cilium has been a significant success for the Form3 team. It has enabled them to meet the regulatory requirements necessary for their FPS business operations and has also opened the door for their new multi-cloud payments service.
“The main value add of Cilium is the network policies, being able to use Hubble for debugging, which is useful when things don’t work, and having a common CNI everywhere to reduce the cognitive burden on your engineering team.”
Kevin Holditch, VP of Engineering: Platform