Member post by Anjali Udasi, Technical Writer, Zenduty and Shubham Srivastava, Developer Relations Engineer, Zenduty

Joel Studler and Ashan Senevirathne took the stage at KubeCon + CloudNativeCon Europe in Paris with their presentation, “From GitOps to Kubernetes Resource Model,” highlighting Swisscom’s automation journey in the 5G Core and reflecting the company’s evolution from telco to TechCo.

Their talk was truly compelling, sparking our interest in learning more about their experiences and journey at Swisscom.

Shubham Srivastava, who currently leads Developer Relations at Zenduty, had the pleasure of speaking with this dynamic duo: Joel, a DevOps Engineer and System Architect dedicated to building the next generation of mobile networking using cloud-native technologies, and Ashan, a Product Owner overseeing the design, implementation, and delivery of a cloud-native orchestration framework for the mobile organization. 

Building the Future of 5G with Cloud-Native Tech presented by Ashan Senevirathne and Joel Studler from swisscom

In this blog featuring Joel and Ashan, we peel the layers of the telco world, the struggles of modernization, and the cutting-edge tools these minds put to use. This is a conversation you don’t want to miss!

Shubham: It’s great to chat with you both. We’d love to know what an average day for you both looks like. You’re leading DevOps and reliability at Telco with very tight error budgets and room for failure. So what does that look like behind the scenes? 

Ashan: For us, our main focus is on developing tooling capabilities for the upcoming 5G core technology, which we find applicable to other areas of the business as well. We put a lot of emphasis on community-driven initiatives. While our main focus is on Kubernetes environments, we also address the transition from legacy-based change management to cloud-native approaches, which requires a shift in organizational mindset.

Joel: My role involves handling technical interfaces with the product, and collaborating closely with Ashan on architecture and engineering. Our daily tasks involve building reliable tools and automation, predominantly through Kubernetes operators. We prioritize designing sustainable and efficient solutions while optimizing existing workloads. Testing and deployment typically occur on live or pre-production clusters.

Keep reading

Shubham: Ashan, you mentioned that half of the job is not migrating processes, it’s also migrating the mindsets of the people. So stemming from that, what do you think is the hardest part about maintaining and updating reliability and tooling in a telecom industry that’s typically viewed as being very archaic and having a lot of legacy processes?

Ashan: The biggest hurdle in the telecom industry is adapting to a more open and flexible approach to network management. Traditionally, telecom relies on vendor-provided “black box” software, making it difficult to maintain and update tools reliably.

But, now we’re tackling this by:

Joel:

And also with the fact that we strategically decided to go forward with Kubernetes operators and the Kubernetes concepts for automation, has a big impact on many other topics like change management.

For instance: How do you implement a change? How do you plan with Kubernetes resources? You don’t control when the change happens. The operator just rolls it out now and then. You don’t control it.

It’s a dynamic system, sparking a range of questions that we’ll be discussing a lot in the near future. This shift impacts not just technology but also requires a cultural change within the organization. The company is focusing on education and demonstrations to promote this new way of working.

Shubham: As an organization that’s so large and serves millions of people every day with mission-critical services, how do you handle the transition to Kubernetes? Specifically, how do you support Kubernetes and other open-source tools that enhance its capabilities? How do you vet open-source tools and new technologies from the CNCF ecosystem or elsewhere to ensure they’re stable and suitable for our organization?

Ashan: From a Kubernetes perspective, we use a vendor-provided distribution for our infrastructure. For deployment, we utilize Flux, along with the external secret operator, cert-manager, and several other mature tools within the ecosystem.

For anything telco-specific, we often develop our solutions and have strong support internally to open-source these projects. This allows us to contribute to the community and encourage contributions from other operators, integrating telco-specific use cases into the Kubernetes ecosystem.

When selecting tools, we prioritize maturity and support from the community and other industry players over novelty. This ensures we choose reliable, well-supported tools rather than simply the latest trends.

Shubham: Joel, are there any tools that have caught your eye recently, tools that you’d love to play around with and are watching closely?

Joel: For the development flow, I’m interested in Microcks. It’s a mocking framework and its innovation lies in its usability both within your IDE and on the Kubernetes cluster.

We’re also exploring testing tools like Litmus for chaos engineering and Testkube, a testing wrapper.

For instance, we’re adopting cert-manager, but in a Mobile Core on-prem environment with black box applications, it’s challenging. We’re pushing vendors to ensure compatibility with cert-manager, despite their tendency to fork and maintain their own versions.

Ashan: Additionally, we’re looking into a project called Nephio, driven by the Linux Foundation. It’s designed for deploying and managing the 5G core in a cloud-native way. While we don’t use Nephio tooling directly, we adapt its framework and thinking. For instance, we’re contributing to and leveraging the SDC (Schema Driven Configuration) tool within the project.

Shubham: Are there any problems that you guys face currently and you’re waiting for a tool to come up and solve it? Like a hard problem that you guys would not want to build something to solve and you’re looking for someone else to come up and solve? 

Ashan: The problem that we’re seeing is, or the technical challenge that we have is, we have these telco applications and it’s treated as an appliance. And during the lifecycle space or the deployment and the configuration phase, we need to do the deployment in a cloud-native way.

And then what comes on top on the appliance level or the configuration, it’s done in a telco way. So there’s this proprietary interface defined by the telco standards and you need to do this configuration or apply or define these network services outside of the Kubernetes layer. And to achieve this, you need to do all these workarounds on top of that where we need to implement custom operators or find certain ways to bring this, what’s done outside of the Kubernetes layer, more into the in-bed Kubernetes layer.

If there’s an ask, the ask would be to have this configuration done in a Kubernetes native way, which means moving away from these NETCONF-based files, into a Kubernetes resource model. This shift would provide significant benefits, especially considering the time and effort we currently invest in making the configuration Kubernetes-native.

Joel: Our biggest pain point right now is that we can’t handle our applications as true cloud-native citizens. The applications we receive from vendors are still treated like traditional hardware, with manual configurations akin to putting a server into a rack or setting up a bare metal appliance.

The mindset remains tied to the idea of a permanent system, like a Helm release, where changes are made directly on the running system. This approach prevents us from implementing practices like blue-green deployments, and even a simple redeployment becomes a huge effort due to the manual steps involved.

We believe that introducing a cloud-native configuration interface would simplify lifecycle management, updates, and configurations. 

Shubham: Observability must be crucial in your journey, especially as you ramp up. What does your observability framework look like? What metrics are you spending most of your time monitoring? We’d love to know more about how you handle monitoring and observability at Swisscom.

Joel: Currently, we use a standard observability stack with Prometheus in our clusters and Loki for logging. For centralized deployment, we use Thanos. Additionally, we consume an internal Observability-as-a-Service stack that any Swisscom team can use, built on the standard Prometheus-Grafana stack, which integrates well with our Incident Management tooling.

We focus on a minimal, relevant subset of metrics to ensure service health. The 5G core and applications are more complex due to their black-box nature, so we work closely with vendors to identify the right metrics. 

Ashan: We use HTTP-based requests to monitor golden signals and key performance indicators (KPIs) such as user attachments, latency, and DNS metrics. For example in 5G core, we will have how many users are attached, what’s the latency or some metrics on the DNS.

While some metrics are standard, others are more telco-specific, requiring vendor collaboration.

Shubham: You mentioned a while ago that you’re spending a lot of effort bolstering site reliability movements within your organization. SRE can be very demanding, as we’re all familiar with it. What’s the story at your org? How are you managing work-life balance, especially in a Telco environment where nothing can go wrong?

Ashan: I can speak on behalf of the mobile organization at Swisscom. We have a significant IT side as well, but our focus here is on the mobile sector.

Throughout our SRE journey, we’ve learned that not all Google-defined site reliability engineering practices directly apply to the Telco space. Instead, we’ve shifted our focus to service reliability, defining specific services offered by our mobile organization.

For instance, 5G is infrastructure, but the service is mobile data—like users browsing YouTube. We start by defining these services, identifying the underlying resources, and establishing SLAs and SLOs for each service. From there, we implement best practices in release engineering, observability, reliability, and security.

In the 4G era, particularly with the Evolved Packet Core on virtual machines, we’ve heavily invested in these principles. As we transition to the 5G core, we will apply the same principles, but in a cloud-native way, simplifying processes. This convergence of SRE and cloud-native transformations is key to our approach in the 5G domain.

Joel: Another unique aspect of Swisscom’s approach is encouraging a cultural shift among our engineers. We emphasize that every decision in engineering or operations has a reliability impact. Encouraging individual responsibility and continuous improvement in operations is crucial. Moreover, having management that supports and encourages this mindset is crucial. This cultural shift has had the most significant impact on our organization.

We’ve started defining SLOs and maintaining error budgets for our services, but we apply them selectively, not at every resource level. When moving to Kubernetes operators, many SRE concepts, such as reconciliation, are automated by the Kubernetes layer. This automation puts us on the right track, and we’re excited to see the benefits it will bring to the organization.

And that wraps up our conversation with Joel and Ashan! It’s always insightful to discuss observability, the demanding nature of SRE, and the innovative tools these reliability heroes are using to build products used by millions of people.

If you’re fascinated by reliability and the intricate process of recovering from downtime, check out our podcast – Incidentally Reliable, where veterans from Docker, Amazon, Walmart, and other industry-leading organizations, share their experiences, challenges, and success stories from the Cloud Native world.

Authors: 

Anjali Udasi (Technical Writer), Shubham Srivastava (Developer Relations Engineer)

Author’s headshot:

Anjali Udasi
Anjali Udasi
Shubham Srivastava
Shubham Srivastava