On the Value of Platform Engineering

I have been working as a Platform Engineer for a few years now. I like my work, I am passionate about it. Aside from the toil… unless I can automate it, then I am happy too. In this article, I’d like to discuss the merits that my particular specialization brings to the table. The Platform Engineer role is still somewhat new and therefore it is not widely agreed upon what the role entails and thereby what value it provides. I wouldn’t be surprised if you’d ask three people with opinions on the matter that you’d get four answers - at least in the details. Further, as Platform Engineering is close to infrastructure, there is no immediate and obvious value visible, as, say, Frontend Engineering would be able to show.

Either way, my real motivation is to clarify my own thoughts on the matter and writing helps me in the process.

Since I am working exclusively within cloud infrastructures these days, all my arguments and mentioning of infrastructure resources should be seen in that context; although I guess many would be transferrable to more classical data centers, or even broader scopes.

Before I provide my argument for value, I think it’s prudent to first take a short excursion into the depth of the history of the interwebs. Not only for fun, but also to paint a picture of the context that the value I am promising will emerge from. Of course, I’ll be speaking from my personal experience, so your mileage may vary …

A Short History of Cloud Computing

Cloud computing is the on-demand availability of computer system resources, especially data storage (cloud storage) and computing power, without direct active management by the user - Wikipedia: Cloud computing

With the advent of modern Cloud Computing one major thing changed in the computing landscape: Hardware now had an on-demand API. Actually, not hardware, but resources that were previously only expressed in hardware. That did not happen over night, there was no first provider that used the word Cloud and set a new standard for all (even if some might claim that). It was, and still is, a process driven by necessity and demand. It’s also not one process, but multiple processes running concurrently, moving somewhat in the same direction.

Two of causal concepts or technologies that I think are particularly important in the context of platform engineering, are:

Virtualization: While the concept itself is decades old, and was productized for edge-cases and large scale enterprise solutions long ago, the first implementation that significantly impacted the larger market were virtual servers, in my opinion. Virtual servers allowed data center owners to separate their expensive hardware into smaller chunks, which could be used by themselves and especially sold to others much easier. This allowed more efficient utilization of available resources. But even more important: With having the ability to create multiple virtual servers within a physical server, the next logical step was, of course, to automate that process and then further: instrument that automatization. The infrastructure API was born. And it goes way beyond managing mere virtual servers; network firewalls, CDNs, database, etc - all virtual, all available via API.

Containerization (& Orchestration): In a strict sense containerization is a specific implementation of virtualization. I still think it deserves it’s own paragraph as no other single technology containerization builds the bridge in between infrastructure providers and infrastructure users. Encapsulating an application in a container allowed both parties to agree on a common denominator. As long as it fits in a container, we can run it. And this is where orchestration comes in, as the infrastructure perspective on containerization: Running standardized containers at scale in isolation.

New Paradigms

The wide-spread availability of the above technologies (and other factors, I am sure) triggered a lot of rethinking and led to a whole host of new perspectives and subsequent businesses arising. Again, I’d like to highlight a few which I consider especially impactful and relevant for context:

Micro Services: Micro Services are an entirely different perspective on application design and implementation. It makes you think of even the most complex application as a set of distinct, potentially inter-connected functionality. This mindset has immediate consequences not only for implementation (build “simplistic” Micro Services), but also for organization (ownership of well-defined parts of the whole), and, of course, for deployment and execution: where monolithic designs often came with a long list of highly specific requirements towards the OS and the hardware (resources), Micro Services encapsulated in containers can work in much more homogenous infrastructure, scale not as a whole but per service (part), fail individually with smaller blast radius and so on. That doesn’t mean every application is (or should be) designed from the get go as a Micro Service, but it creates the mindset, drives organization and fosters standardization nonetheless.

Pets vs Cattle: Back in the days, servers used to have names and were individually and lovingly configured. Having worked on bare-metal with physical machines that I personally put a sticker with a name on before shoving them into the rack, I am well aware of the Pets that can be created. With having access to APIs that provide virtual servers, or any other resource, there was no way or need to keep up with that. Infrastructure users can now buy resources entirely detached from physical things. On-demand and auto-scaling. As a consequence, availability of resources is of far less concern (they are just available); we think in stacks (of resources) and not individual machines. Resources are numbers. Cattle, not Pets. Read more from Randy Bias, who summarized the analogy perfectly.

Infrastructure as Code: Applications are complicated. Operating systems are complicated. When both correlate it becomes even more complicated. There used to be a time when most people, myself included, configured systems manually. That resulted in deep and often unique knowledge about the systems. Unique, as in: low bus factor, as in: Nobody knows about the system, but one person. The answer to that was configuration management, which allowed the maintainers to describe the whole setup in a replicable, testable, version controlled and standardized documented way. With that system, configurations can be easily re-created (comparatively speaking), the human error factor is massively reduced and any setup is understood in short order by others (comparatively speaking). Infrastructure as Code (IaC) does the same thing for, well, infrastructure resources. Instead of manually provisioning / configuring / ordering resources, that are required for a specific service or application, that can now be described in code, which is standardized documented, checked into version control, code reviewable, testable and replicable.

As a Service: As infrastructure resource automation became ever more commonplace, things that run on said infrastructure also gained easy access to a super-power: scalability. What that means is that you only need to pay the resources that you need. Or from the business perspective: You pay only for what you actually sell. Roughly speaking. With that in mind, it’s easy to understand why these days about every functionality can be bought as a service. Think of it as a supply chain, with Infrastructure as a Service (IaaS) providing the foundation. That enables Platform as a Service (PaaS) to build their services upon - like compute runtimes, databases, cdn, etc. With that, Software as a Service (SaaS) builders can concentrate exclusively on their domain: software building, not deeply understanding the infrastructure layer. And everybody can grow or shrink as demand requires. Note that there are 75 as a Service types at the time of writing listed on Wikipedia, so I left out a few. Point is: The market changed. Concerns like hardware, operating system, service operations can now be outsourced, bought on-demand.

All these paradigms deeply entangled. You can consider as a Service a result of Micro Services or the other way around. Also Pets vs Cattle would hardly be possible without Infrastructure as Code, and the latter was needed by the former.

What is Platform Engineering?

Before I can address the value I argue Platform Engineering brings to the table, I first need to clarify a bit what I mean - or don’t mean - by that. Also mind that I am of the opinion that roles are like hats: You wear them as needed. Some hats you wear more often, some you like better than others, some you stash away and hope you never ever have to take them out again. Roles are not clear cut, roles are not bound to job titles either.

Platforming

I think it is safe to say that Platform Engineers are involved in Platforms. Since the term Platform is rather vague I will try explain what I mean by that - within the scope of this article. Let’s start with: a Platform is a well defined interface, that simplifies processes on and access to resources.

In the Platform Engineering world, these resources are, of course, infrastructure resources of one type or another. From a software developer perspective the Facade design pattern is the closest approximation of key properties of a Platform I can think of. I would argue a Platform consists of one or multiple “service facades” combined with their interface specifications and documentation. All of these are required, as the intention of a platform is to automate common use-cases.

Platforms are mostly transparent to the user (as in: you don’t know what’s happening inside), but I am not sure whether or not this is a requirement - though it certainly is a good idea in most cases.

By this definition, Github would be a Platform. AWS would be a Platform - or maybe rather a Platform of Platforms. But also, any individual, in-house created solution that provides a simplified interface to manage the infrastructure resources that your use-case demands is a Platform.

Platforming then, is the process of identifying common use-cases (or patterns), modelling interfaces that describe them, providing automated solutions that implement them, and documenting it so it can be used with the least amount of manual interaction.

Platform Engineering is then the discipline that is concerned with Platforming.

Compared to Other Roles

While Platform Engineering (PE) is a new role, it did not come from nothing, but emerged as a specialization from other existing roles and responsibilities - like all roles. I’d like to highlight a few which I think influence PE greatly:

As the closest “sibling” of the PE role I would consider the DevOps Engineer (DE), which is concerned with providing (cloud infrastructure based) solutions for specific applications. Or so I think - DevOps is a lifestyle, after all. What a PE does is pretty close to that. The main difference for me would be perspective and resulting scale. Whereas the DE advocates for “their” application(s), the PE is concerned with many or all applications. DE solves the special use-case (build infrastructure that a set of specific applications run in), PE solves the common use-case (build platform that all / most applications run in). DE is more interested in specifics, PE is more interested in commonalities. From my experience, both are usually needed, the borders are very blurry and often the same people wear both hats.

Probably involved in the origins of the Platform Engineer role is the Infrastructure Engineer (IE) role, that is concerned with building, deploying and maintaining, well, infrastructures. That role borders somewhere with System Administrator, Network Engineer and likely also DevOps on another side. It extends beyond the realm of cloud, as it is also concerned with data centers. Within this article, only the cloud scope is relevant. Within that the term is, I think, Cloud (Infrastructure) Engineer, so I am referring to that. From here PE took, of course, working with (cloud) infrastructure and generally thinking in infrastructure architectures. The main difference I see is that IE has strong concerns in operability and reliability, whereas PE has a focus on accessibility.

Also playing a role is the Backend Engineer (BE), who is concerned with the data access layer of applications. It’s a broad field, with lots of sub-specializations, and certainly bordering on DevOps. One of the core concerns is designing and implementing APIs, including working with various data sources, queues and all that. PE does that as well, albeit with different purposes in mind. BE are “customers” (be it in-company) of PE created platforms and build their application upon it.

As I said, the borders are hard to draw, and there would be likely more roles (e.g. Site Reliability Engineering) I’d like to compare with, but I have to stop at some point and hope this painted somewhat of a clearer picture.

The Value Proposition

Now, finally to the argument I want to make: The Value of Platform Engineering.

Taking the changes in the technology landscape into account, as well as the resulting new paradigms, one thing becomes clear: Complexity in the application ecosphere is on the rise. This is not caused by increased complexity in using the infrastructure side, nor the application development side. Things in Infrastructure actually got much easier, with Infrastructure as Code et al. Application development, from my perspective, also took huge strides towards ease of use.

So what is it then? Well, it is the fault of the great success that resulted from that: The increased amount of possibilities and opportunities that arose from the changes I highlighted (and others). Everybody now has easy access to scalability? Well, that also means it is now a concern for everybody. And this is why Platform Engineering plays a key role: it helps making those opportunities easily accessible by reducing complexity for the user, the application developer.

In more detail:

The Standard PaaS Pitch

The advantages of PaaS are primarily that it allows for higher-level programming with dramatically reduced complexity - Wikipedia: Platform as a Service

First of all: I think that should be the general as a Service pitch, not specifically for PaaS. Like: The advantages of as a Service are primarily that it allows for higher-level use with dramatically reduced complexity - you know, sending an e-mail, streaming movies. All much better interfaces with dramatic reduction in complexity - considering the alternatives.

Anyway, this is what Platform Engineering can do: create higher-level interfaces for infrastructure resources that are designed so that they fit the use-case(s) and then implement them. Just not for any end-users, as the above generic as a Service pitch implies, but for application developers. Standardization comes as a free bonus.

Platform Engineers, of course, do not work exclusively for PaaS providers (there aren’t that many around), they build platforms all over. Most PaaS providers address mass markets. Due to size, security, legal, legacy or many other reasons there are many, many companies that don’t fit in that offering. They still need to use infrastructure to run their services on, that their developers create.

DevOps Engineering and Platform Engineering bridges the gap between developers, who are specialized in the application layer, and infrastructure providers that provide the resources to run the application on. DevOps Engineering provides solutions to run a small set of specialized services in cloud infrastructures. Platform Engineering provides solutions in the form of platforms that run more generalized services in cloud infrastructures. The borderlines in between are blurry and mostly a question of size and growth.

Either way: this allows developers to focus on development of applications, which they are great in. In my coloured opinion, you should of course start with platforming as early as possible, unless you don’t expect infrastructure growth or change.

Value: Lasting, high-paced development progress

Counteract Technical Debt

Technical debt increases when things change or grow. As in: Always, unless you stand still. The question is not how to fully stop it but how to slow it down, so that it becomes manageable and does not increase until you are forced to stand still.

There are many contributing factors to the increase of technical debt. Of course there are business-side causes that seize priorities, leaving little or no time to align infrastructure or software design to counteract the build-up of technical debt. Then, there are structural or architectural causes that contribute. Patterns like Micro Services also arose to address that from an architectural side.

One major factor contributing to fast growth of technical debt is to have too many solutions for the same problem. As in: Every other application has their own unique deployment pipeline or service runtime or database setup or you-name-it. An ever growing zoo of solutions and patterns, that might sound like great ideas in the moment, as they solve the problem at hand, but will become an unmanageable chaos in the future, are highly fragile and there be dragons that you cannot touch without great risk.

Platform Engineering, at least, counteracts this last factor, and to different degrees others, by providing a standardized framework, a common denominator for services. It also enforces clear boundaries and interfaces, that allow you to understand, grow and change the application landscape much easier - in the now and in the future.

Closely related to that is also build up of legacy over time. This also translates to technical debt that needs to be paid at some point. While Platform Engineering does not, cannot, directly tackle legacy, it confines it to the agreed upon boundaries that make up a service or an application.

Lastly, Platform Engineering fosters modern paradigms like the aforementioned Micro Services, which offset technical debt on their own. Platform Engineering has an enabling and supporting role here.

Value: Agility for change / Uninterrupted growth

Ready for Change

The only constant is change. This was already known 2500 years ago and still holds true, even for modern day application infrastructures. Still, it keeps surprising us when it comes. In the last 20 years of working in that ecosphere, I’ve never encountered any system that did not change over time. Change comes in many forms. From small things, like new features that improve the value of an application, up to foundational change, like migrating from physical data centers into cloud infrastructure or rebuilding entire applications from scratch. Either way: Things change, you need to be ready for it, or pay a high bill when it comes due. Which it will.

You guessed it, there is a common theme here: Platform Engineering to the rescue. How? It provides a standardized environment to deploy and run your applications. Emphasis is on the word standardized. Think of it like that: Moving a hundred applications that each have their own way to deploy and run their services is very hard, daunting even. Moving a hundred applications that have a common way to deploy and run their services is much less hard. A Platform of course provides these shared commonalities.

That also implies: Platform early on. Maybe not when building an MVP, not even I would advocate for that, but certainly early enough so that standardization is not so painful that you might omit it even longer. Finding that right moment is not easy, hence: Platform early on.

Value: Flexibility (in application infrastructure translates to flexibility in business)

The End

Thanks for reading. I am still iterating on the idea, there is certainly more to say on this topic than I captured here - but I guess you gotta release at some point. I also acknowledge that my focus in this article is mainly on the “Platforming building service facades” aspect, though I am aware that is not the full story of Platform Engineering. You can work all year with Kubernetes manifests, Elastichsearch logging clusters or what-have-you without facading (is that a word?) anything, and still be a top notch Platform Engineer.

Anyhow, thinking about this topic by writing about it helped me a lot, I hope it gave you food for thought as well.

Ulrich Kautz Blog