Guest post originally published on the Humanitec blog by Luca Galante
Platform engineering is the new cool kid on the block that everyone wants to be friends with. However, many are still confused where this new discipline comes from and how it differentiates from more established practices like DevOps and SRE. In this article, we provide some historical context and explain how they all relate to each other.
With the platform engineering movement picking up steam at an impressive pace over the last two years, there’s a lot of excitement as well as confusion around the definition of this new space and how it compares to the more established SRE and DevOps categories. On the Platform Engineering Slack we see many new members coming over from traditional SRE and DevOps roles. Some want a change in their career, some think platform engineering is the new hot thing they can leverage for a salary raise, some are confused what the difference to their current role is.
This blog post is aimed at giving an overview of the trend of platform engineering and hopefully bring some clarity around the definitions and differences between these terms. This is just one opinion so feel free to add to it, bash it or share your own in Slack or maybe directly at PlatformCon.
Evolution of DevOps
Let’s start from the beginning of cloud times, back in 2006 when Werner Vogels, CTO of Amazon, famously yelled on stage “you build it, you run it”. A new DevOps era was ushered into existence, together with AWS, and the convergence of developers and operations seemed the brave new world. The DevOps bible Accelerate described the 4 key DevOps metrics (lead time, deployment frequency, change failure rate and mean time to recovery or MTTR), which have since become an industry gold standard. Leading engineering organizations leveraged this new philosophy to develop, deliver and ship software faster and better than ever. Some were now able to deploy 100s or 1000s of times a day and deliver value to their customer at a speed that was unthinkable in the old throw-over-the-fence days.
Sounds great right? And it is. But only for a selected few.
While many organizations are advancing on their DevOps journey, studies like the State of DevOps Report by Puppet or Humanitec’s Benchmarking Study have shown that too many teams are still stuck in the middle and can’t cross what Humanitec calls the DevOps Mountain of Tears.
Why are these organizations stuck? Why can’t they enter the brave new world of DevOps? There can be a number of reasons of course. But a consistent theme that we see across industry studies, as well as in our Platform Engineering community, is that most of them are overwhelmed by the sheer complexity of their shiny new cloud native setups. Teams often embark on their cloud journeys thinking a microservice architecture running on Kubernetes will fix all their problems.
The reality is usually very different. Cognitive load on developers constantly increased during the last 10 years. And in most cases such migrations are massively underestimated, ending up in a setup that is very hard to maintain, with teams not having all the right skill sets in place to make the most of their brand new infrastructure. Simply shifting left and telling developers they now do DevOps is a recipe for disaster, which often results in more senior developers doing “shadow ops”, while the actual Infra or Ops team is completely overwhelmed with tickets.
The reality is usually very different and in most cases such migrations are massively underestimated, ending up in a setup that is very hard to maintain, with teams not having all the right skill sets in place to make the most of their brand new infrastructure. Simply shifting left and telling developers they now do DevOps is a recipe for disaster, which often results in more senior developers doing “shadow ops”, while the actual Infra or Ops team is completely overwhelmed with tickets.
Fake SRE
The SRE story is similar. Established and popularized by Google, this concept was sold to many engineering organizations as the dream culture everyone should aspire to have. And there is nothing against that at all. SLOs and error budgets are great metrics. There is nothing more valuable than having a reliable production environment with an uptime of 99,9%.
🔮 The widespread adoption of fake-SRE is going to be a disaster for organizations and the reputation of the IT industry, undoing a decade of good practices.
It will, however, enrich some unscrupulous IT outsourcing companies. 💸 #SRE #fakeSRE https://t.co/Cv2j6438p1— Matthew Skelton #BLM 💙🌻 (@matthewpskelton) August 26, 2021
But the reality often looks a bit different. What do you do if your quarterly error budget is already eaten up after two weeks? What happens when your SREs are constantly overworked and close to burnout because of too many unplanned night shifts? Suddenly, SRE can become a pretty restrictive function and your SRE team will think twice before they accept any further deployments.
Unlike at Google, SREs in most organizations do not have the capacity to constantly think about better ways to enable developer self-service, improve architecture and infrastructure tooling, while also establishing an observability and tracing setup. Most SRE teams are just trying to survive.
This leads pretty often to a very conservative mindset. Many SREs – for good reasons – see themselves as gatekeepers to prevent the next disaster.
We see many teams striving to implement best practices from elite engineering organizations, without taking into account the key differences between the respective setups and, especially, resources. The quote below from DevOps Topologies captures this anti-pattern quite well:
“The Ops engineers now get to call themselves SREs but little else has changed. Devs still throw software that is only ‘feature-complete’ over the wall to SREs. Software operability still suffers because Devs are no closer to actually running the software that they build, and the SREs still don’t have time to engage with Devs to fix problems when they arise.”
Platform teams to the rescue!
So how can you avoid such anti-patterns? You guessed it, you set up an internal platform team whose main objective is to build an Internal Developer Platform, or IDP. The book Team Topologies has become a standard reference on this and clearly shows how structuring an engineering organization to have dedicated platform teams that provide an IDP as a product to their customers, the developers, is the best way to overcome the fake SRE or shadow Ops anti-patterns.
This is backed up by the research and data in the Puppet’s State of DevOps report for both 2020 and 2021, which found a strong correlation between the increased performance of an engineering organization and the usage of internal platforms.
The core of platform engineering: product mindset!
This is something that most vendors of platform tooling like control planes tend not to tell you. They want you to believe using their product is the solution to all your problems: “Great, I don’t have to hire a PM for this, I can just outsource it to this tool!”.
But this is extremely unlikely to work. A functional Internal Developer Platform needs to be owned by an internal platform team that treats it as a product and keeps iterating on it, based on feedback from the rest of the engineering organization. If you want to dive deeper, check out this recent talk by Manuel Pais (co-author Team Topologies) on why you should treat your Platform as a Product at PlatformCon 2022.
Making sure you enforce a product mindset when setting up your platform engineering team is key for long term success. What you definitely don’t want your platform team to become is a glorified help desk that is constantly trying to keep up with a bunch of support operations tickets.
This was formulated very clearly in the dedicated section on the Thoughtworks Tech Radar:
“Using a product-thinking approach can help you clarify what each of your internal platforms should provide, depending on its customers. Companies that put their platform teams behind a ticketing system like an old-school operations silo find the same disadvantages of misaligned prioritization: slow feedback and response, resource allocation contention and other well-known problems caused by the silo. We’ve also seen several new tools and integration patterns for teams and technologies emerge, allowing more effective partitioning of both.”
This also means your platform team should build out a dedicated product management functionality. You should treat this team as a full blown product team, not just a new operations hybrid with a trendier name.
Here’s a handful of guiding principles we gathered from some of the top performing engineering organizations that have built successful platform cultures
- Optimize for speed, not for cost (Courtney Kissler)
- Optimize for better developer experience and reduction of cognitive load
- Do your user research on what a great developer experience means to your dev team; otherwise you will end up like this
- Come up with a good balance of the cognitive load tradeoff and find the right degree of abstraction for your team (see Humanitec’s DevOps Benchmarking Study 2021)
- Provide golden paths, not cages
- Evangelize your platform internally. Train and onboard users like you’d do for any other product. For more inspiration check Galo Navarro’s recent talk Salesman tricks for the platform engineer.
The goal should always be to build an Internal Developer Platform that serves the needs of your internal customers, the engineers. The results will hopefully be more stability and less configuration drift (SRE, you did it!), while also creating a true you build, you run it culture (DevOps, yey!).
This is really exciting, if you can execute on it well. There’s a constantly growing amount of conversations and threads around these topics in the Platform Engineering community. I was just speaking to a platform PM the other day who was quite enthusiastic about it. In his words, he “finally found a space and community that shares this product mindset when building platforms. Most other SRE or traditional DevOps communities tend to have a much narrower focus.”
I am particularly excited about the latest initiative by the community, the first PlatformCon earlier in June this year. It was great to see how more than 6.000 platform practitioners, DevOps and SRE folks, and product managers engaged with the speakers and with one another. We had more than 70 speakers, among them thought leaders like Gregor Hohpe (author Cloud Strategy), Nicki Watt (OpenCredo) and Nigel Kersten (Puppet). We saw great real-life implementation examples from the Netflix team, nesto or Frontside to name just a few. We can’t wait for PlatformCon23!