Operating multiple high-density bare-metal clusters in a highly regulated industry
Challenge
Finleap Connect operates in a highly regulated environment. In 2019, they had five months to implement mutual TLS (mTLS) across all services in their clusters for their business code to comply with the new European PSD2 payment directive.
Solution
Finleap Connect used Linkerd to mTLS all services at scale while minimizing the impact on developer productivity.
Impact
“The cloud native transformation was a huge undertaking — especially considering our tight deadline of five months! To our surprise, the mTLS aspect was fairly easy. Linkerd was installed in an hour and running in production within a week without impacting our developer team,” said Christian Hüning, Director of Cloud Technologies & Switchkit at Finleap Connect.
By the numbers
Fast rollout
Services mTLSed within 1 hour. In production within 1 week
scale
50+ microservices spread across 10 Kubernetes clusters with ≤52 nodes, 5,000 pods, distributed in 3 regions across 3 cloud providers
Canary deployment frequency
Easily deploying to production 20-50 times a week.
mTLSing services with Linkerd at scale without impacting developer productivity
Finleap Connect operates multiple high-density bare-metal Kubernetes clusters with up to 5,000 pods — keeping their customers’ highly sensitive financial data safe is business-critical. Needless to say, security is paramount.
Connect’s cloud team migrated its entire platform to a cloud native architecture while mTLsing all services to comply with strict regulatory requirements. “It was a huge undertaking — especially considering our tight deadline of five months! To our surprise, the mTLS aspect was fairly easy. Linkerd was installed in an hour and running in production within a week without impacting our developer team,” said Christian Hüning, Director of Cloud Technologies.
This five-month deadline was driven by two things. First, there was the European PSD2 payment directive, a new EU law requiring payment services providers to improve customer authentication processes. Second, Connect’s legacy system was hard to maintain. Every night, something seemed to break, changes were hard, and failovers were mostly manual. That’s why Hüning’s team decided to migrate all customers to a new cloud native infrastructure.
Enabling next-gen financial services
Finleap Connect is a leading independent European open banking platform. Their full-stack platform enables organizations across banking, accounting, and lending to provide next-gen, mobile-first financial services to their customers. Services include data and analytics enrichment, default financial data accessibility, seamless payments across a range of applications, and much more.
The Connect team understands how customers transact and interact, and that know-how is embedded into their platform. Additionally, the platform allows their clients to compliantly access their customer’s financial transactions and enrich that data with analytics tools, all while providing digital banking services that deliver high-quality, digital products and services.
The engineering team
The engineering team includes a handful of cloud engineers and around 60 developers spread across multiple smaller teams. The cloud team is responsible for 50+ microservices spread across ten Kubernetes clusters, distributed in three geographic regions across GCP, AWS, and a bare-metal private cloud. Their largest cluster runs 52 nodes with 5,000 pods.
To operate Connect Cloud, their cloud-agnostic private setup, Finleap Connect uses SAP Gardener. Linkerd is deployed across all clusters and represents an integral part of their infrastructure. Linkerd, including its metrics, is centrally managed through Buoyant Cloud.
mTLS across all services, a regulatory requirement
Connect operates in a highly regulated environment and, as such, important considerations have to be taken into account — they are dealing with highly sensitive financial data. In 2018, that meant implementing mTLS across all services in their clusters, independent of the actual business code (i.e. solve it on a different layer).
To address that challenge, they evaluated a variety of available solutions. One of the options was Istio. They installed it on their test cluster and, while it worked fine, it also required a fair amount of configuration. When they realized they’d need a configuration for each service, Istio quickly became less feasible. This was back in 2018 (the early days) when Connect was migrating their entire stack to a cloud native architecture on a very ambitious roadmap (the migration had to be finalized within five months). Their development teams were already dealing with a good amount of transformation tasks, so they concluded that Istio would become an additional configuration burden and decided against it.
The other service mesh they looked at was Linkerd (back then known as Conduit). Its approach to simplicity, roadmap, and the various Slack discussions they had with the project maintainers gave them the confidence they needed to move forward with this project.
Linkerd: installed within an hour, in prod within a week
The Linkerd installation took less than an hour; the overall production setup was probably about a week. Some updates, including contributing to Linkerd’s certificate management feature, took a few additional weeks (more to that in a minute).
Overall, Connect really liked the entire experience with Linkerd. While they ran into a bit of trouble when starting to run Linkerd at scale, it mostly boiled down to required scale-up and -out of Linkerd components.
There were a few things that were on the Linkerd roadmap but not yet implemented. Certificate management was one of them. Certificates expire after a year and Connect had one year to address that. They decided to contribute to the Linkerd project to help develop that feature. Today, certificate rotation is fully automated. Applications using server-speaks-first protocols were another example, but that is now also supported since Linkerd 2.10.
End-to-end encryption with minimal impact on developer productivity
Connect was able to implement mTLS across all services at scale while minimizing the impact on developer productivity. The entire process was fairly quick and allowed them to meet their initial critical deadline to go live with their new platform. Without Linkerd, they wouldn’t have been able to achieve that.
Additionally, Linkerd’s four golden signal metrics are particularly useful for uniform and generic platform-level debugging and service health observability. These metrics provide them with immediate insights when migrating workloads to Kubernetes. The cloud team gets all these insights without having to dig too deep into application specifics — a big time-saver for the team and a great way to get started into cloud native application management for new developments.
They have also implemented canary deployments through Linkerd and Flagger and can now deliver features faster and with a lot more confidence.
“All this was almost automatically enabled by deploying and activating Linkerd across our applications. Linkerd helped us avoid more complex TLS setups for certain services, saving my team lots of backlog time,” explains Hüning. “This is all pretty neat and one of the reasons I’ve been so outspoken about this project.”
The Linkerd community
“The Linkerd community is the best! Everyone is incredibly welcoming. The Slack channel is a great way to get valuable input and collaborate with others. You can literally find solutions to any kind of problem,” states Hüning. “In fact, you can find me there regularly. I’ve been active in the community and, because I enjoy jumping on any opportunity to help educate others, I was invited to become a Linkerd Ambassador along with some other fantastic Linkerd end users.”
Heavily invested in CNCF projects and open source
Over the years, Connect has become heavily invested in CNCF open source solutions and uses them across all their stacks. They use the Emissary-Ingress (formerly Ambassador), Flagger for canary deployments, cert-manager for all certificate handling, Prometheus and Grafana for platform monitoring, NATS, and Rook and Ceph for storage in private cloud clusters.
Other non-CNCF open source projects include CockroachDB for all SQL databases, NGINX Ingress, Hashicorp Vault, and RabbitMQ.
Zero-trust doesn’t have to be hard
For companies like Connect, operating in the fintech industry, zero-trust is a requirement. But zero-trust is increasingly becoming a must-have across industries. In a microservices world, a firewall won’t do the trick anymore. While many cloud or enterprise architects are concerned about the complexity they might be adding to an already complex system, zero-trust doesn’t have to be hard.
Connect was able to mTLS all services within five months while minimizing the impact on developer productivity. Additionally, the platform metrics have proven to be a real time-saver when debugging and keeping a pulse on service health.
Connect chose Linkerd because the configuration is minimal while providing key features almost automatically. The features that were not yet available at the time, they helped build. “Today, there is no reason not to mTLS all your services,” said Hüning.