Platform Engineering: Development at Scale

The topic of Platform Engineering has become all the rage in the last year. Good! It is about time. Software development is at the heart of an ever growing number of organizations and platforms are a necessary response to anchor best practices and standardize resource use – which are increasingly hard problems proportional to the complexity that comes with organization and software growth.

Most of the discussions around the topic over the last year and a half have passed by me. I do have a good excuse, for I had and have the privilege to be part of one of the largest infrastructures of the planet that received all my attention instead. Needless to say that I learned a tremendous amount about platforming at scale. This of course changed my perspective since I last wrote about this topic in March ’21.

I recently picked up the ball again and started reading a lot of newly published articles about what people think that platforms are and what platform engineering does. I found that some are arguing that platform engineering is an evolution of DevOps. Even that platform engineering is the practice while DevOps is the philosophy. That made me a bit twitchy, because building platforms has been at the heart of what I was doing for the last two decades or so – the term DevOps has certainly not been around for that long.

On the other hand, I certainly support pitching the platform model to organizations that already bought into the DevOps paradigm. I can also see where those arguments are coming from and how DevOps and platform engineering intersect. However, I think conflating both terms in such a way limits your thinking unnecessarily. To me platform engineering is foremost a practice to enable software development at scale. By that, platforms are a tool. How and why you wield that tool is a question that can have many answers. One answer could be that it fits into your DevOps organization, but that is not the only one – you can gain value out of platform engineering in any case.

What are Platforms?

I want you to think about the platform concept without any preconceived notions about DevOps or anything else for a moment. Consider the following entirely non-technical example from even before the interwebs was around: a travel agency of the olden days. Back when phones had cables and dials – and no displays. You know, shortly after the dinosaurs roamed on earth. It was pretty hard to go on vacation anywhere outside of your own backyard. The same as today you needed to book a flight, find and book a place to stay at the destination, ensure access to local transport and plan activities outside of loafing by the pool. Now imagine doing that pre-internet in a country whose language you do not speak. Imagine even just calling all the airlines directly and asking them for flight times and prices. Yes, a nightmare. Travel agencies are an excellent, if mostly obsolete, example for a user-facing platform. The complexity of booking all the components and thinking of all the things, and then orchestrating everything so that it fits perfectly together, was reduced to a simple conversation with an agent and maybe looking through a couple of catalogs. This is an example of a platform. Not a platform created by platform engineers, but a platform nonetheless. The example shows the first fundamental property of a platform: a platform reduces the complexity of underlying systems. Basically: It makes complicated things simple to use.

Platform reduces the complexity of underlying systems

You probably noted that I threw in the term systems. I will also soon use the related term service. Let me quickly explain what these two terms mean in this context: A service is “a single, self-contained unit of functionality”. A service implements the functionality itself. In the above example flying is a service that airlines provide. A system is a collection of components that could be platforms, services or systems themselves. The travel agency above would be a system and a platform. Each airline would be a system as well.

With that out of the way consider another fundamental platform aspect that the above example describes: the travel agency cherry picks functionality (services / systems) from the airlines, the hotels, the local transport and so forth, and offers this choice-set to their own client. More so, the agency re-packages that functionality in a way that their client may not even be aware who provides it while combining it (ideally) seamlessly with other functionality from other providers. This is done with the goal to make booking a vacation an easily accessible experience. In a sentence, a platform is a higher level system that refines one or more lower level systems. You can think of a hierarchical structure that has higher and lower levels. The platform is at the top and the other systems are below it. When interacting with the platform you only see the top layer (i.e. the travel agency) and the lower layers are hidden.

Platform is a higher level system that refines one or more lower level systems

By cherry picking functionality another fundamental aspect of platforms reveals itself: a platform serves a limited set of specialized use-cases. It has to, because if you could do the same things with a platform as with all the systems beneath it, then the platform would be exactly as complicated as the combined underlying systems and thereby entirely redundant.

In the context of the above example: A travel agency would not be the place to, say, book cargo flights to transport goods from A to B – even though the airlines that the travel agency uses also offer cargo flights.

Lastly – and this is not always fully possible, but always highly recommended: A platform abstracts the implementation behind an isolated, well-defined interface. That means that changes of the underlying systems have (ideally) no impact on the exposed interface of the platform itself. Again, using the example of the travel agency, this would mean that if from tomorrow on you would not anymore travel by airplane, but by zeppelin or dragon or something, then the way you book your travel would still be the same: you talk to an agent, tell them where you want to go at what dates and they take it from there. There are some asterisks to that, because it is not always possible to fully isolate the platform interface from underlying systems (i.e. zeppelins and dragons don’t offer 1st class anymore, so you cannot book that anymore). More on that in the benefits topic below.

Platform abstracts the implementation behind an isolated, well-defined interface

As I wrote earlier, platforms can exist entirely outside of the DevOs paradigm. At least the above properties do not require it.

What is Platform Engineering?

While the platform model transcends industries, the term platform engineering is – as far as I know – exclusively used in the wild world of software engineering. So the above example of a pre-internet platform is valid, but it is not descriptive of what platform engineers do. To understand what that is, you have to further narrow down the realm of software platforms: any user-facing platform – from social media to online travel agencies – is usually out of scope. From what I have observed while working on platforms in various sized companies – including an early Platform-as-a-Service (PaaS) – is that platform engineers almost exclusively create developer-facing platforms. To put it in a single sentence: platform engineers design, build and maintain platforms that facilitate development and delivery of software.

Platform engineers design, build and maintain platforms that facilitate development and delivery of software

This by no means implies that there is a single type of platform that every platform engineer builds. You’ll find mobile app development platforms, web application development platforms, machine learning data platforms, and much more. Platforms come in all shapes and sizes and each is designed to make a different use-case for the specific context of an organization easier to tackle. In larger structures you will see tiers of platforms forming: higher level platforms that build on lower level platforms. In this context developer-facing can equally refer to the developer of the end-product as well as any other (refinement) step in the value chain – like developers of a higher level platform. Platforms are also almost never static: they change and evolve along with the changing and evolving needs of the business.

So what does that mean in practical terms? As a platform engineer you need to take on two perspectives. First, of course, the perspective of the developer. To do that you need to understand how the developers in your organization work. There may be even multiple contexts, like the day-to-day of the mobile app developer, who faces rather different problems than the web application developer. It is up to you to find out whether there are common denominators. In general you need to have a lot of conversations and ask a lot of questions – and listen! What tools do they use? What languages do they write in? Do they have a local development environment? How do they currently deploy? How long does that take? How are automated tests triggered? What business goals shape all of that? And so forth. Gathering answers to all these questions gives you the internal context of what is being done and why. A good platform engineer understands the needs, wants and pains of the developers. This is the first half of the equation.

The other half is being informed about what is going on in the field. Be assured that you are not the first person facing, well, almost any problem. Make use of the wisdom of the domain. Learn how others have tackled the same issue – and have proven their approach right by thousands and thousands of hours of hardening in production. Read about strategies and patterns that aid you in framing the problems and delivering working solutions. This is the external context. A good platform engineer must have knowledge of current industry best practices. Combining both contexts together is what makes the job interesting.

A warning here: Platforms are a great tool to introduce and propagate standards. However, if this is misused to steamroll and force-feed even the greatest practices without involving the very developers that are affected, then chances are adoption will wither and the rollout will fail. Been there, done that. Big mistake. Tact, common sense and pragmatism are more useful values for platform engineers than adherence to the book, idealism or zeal. The most successful platform rollouts I’ve seen were not the technically most sophisticated or the most optimized solutions, but those that had the best feedback loops to check in with the users (the developers) and make their transition the easiest. Platform engineers must advocate for the interests of developers. Without their support and buy-in, all effort is for nil. Besides, it is so much easier to iterate the platform later towards better practices, than attempting to start out in the perfect setup while fighting with the developer’s expectations.

Platform engineers must advocate for the interests of developers

The other perspective that platform engineers need to take on is the infrastructure side. As with any production IT system efficiency of resource use and resilience towards outages are key concerns. Since platforms are multi-tenant systems by design, isolation of tenants – for both security and accounting reasons – needs to have strong guarantees. Scalability and monitoring go hand-in-hand and are essential for any modern platform. All of this needs to be optimized, which often leads to hard coupling of the implementation with the underlying lower (infrastructure) systems. And here is where it becomes hard: you have to balance optimization with the flexibility to exchange whole underlying systems. As a platform engineer you usually work upon cloud infrastructure – whether directly or indirectly. The cloud is constantly improving and changing as well, so you need to be able to adopt new technologies and trends quickly. After all, what does it help if you squeeze out 5% more performance at the same cost, whereas migrating to the next generation gives you 50% more performance for the same price? As a platform engineer you need to be informed of the latest developments in the underlying infrastructure, so that you can leverage opportunities when they come along. Then again, this balance shifts towards optimization when your infrastructure costs make up most of your total expenses. This is certainly the case for very large scale operations, where even shaving of fractions of a percent in resource consumption can translate to millions in annual opex.

Platform engineer you need to be informed of the latest developments in the underlying infrastructure

By the way: While the role of platform engineering is a new one, the same activity and context was around way before that: I am thinking of web hosting. The value of web hosting was to provide the ability to serve a website that was built by web developers, with very little effort. At least much less effort than is required to maintain your own hosting infrastructure. The people that built those web hosting platforms were not called platform engineers, but they pioneered the field (or maybe I am just nostalgic).

What is the Value of Platforms?

Again, I ask you to suspend your any assumptions about DevOps and platform engineering and just think about what value platforms independently provide.

I’ve already tried to convey what I consider to be the primary value: Platforms simplify access to complex functionality. However, that is far from all. Platforms rarely only simplify a single system, they more often merge and simplify multiple systems into a new system: the platform. That often involves additional layers of authentication that control access to the involved resources. By that platforms combine related systems into a new system with a single interface. This combined interface is simpler than the sum of the individual interfaces and reduces the complexity on the consumer side. In specific cases – especially relating to authentication and authorization – another value arises: platforms enable compliance, which may otherwise not be possible or very hard.

Another huge value is that use of the underlying complex systems is aligned. Consider cloud infrastructures like AWS or GCP: You have a multitude of different solution pathways, a huge variety of possible architectures and designs. You can run your code in lambda functions, container runtimes, virtual machines or on bare metal – just to consider a single use-case. A platform standardizes use of resources. This strongly reduces complexity for those that maintain the lower levels and often allows you to decrease cost by buying in bulk. It also helps to keep technical debt at bay by reducing the surface (=variety) for entropy to wreak havoc.

Related to that is that platforms provide a consistent, well-defined interface. The technological landscape is continuously changing and improving. Having the ability to replace basically all the systems that underlie the platform, while keeping the surface interface the same is invaluable. Why? Because all of the things that use the platform do not need to change! So think thrice before committing the cardinal sin of not separating concerns and exposing implementation details. You will pay for it down the line – promise.

If you forget anything else and just remember one thing, then make it this: Platforms allow developers to focus on developing, by reducing the amount of things they need to know to do that.

What are the risks of platforms?

When talking about platforms, it is easy to dote on the benefits and forget about the drawbacks, because there are some and it is important to recognize that not every situation calls for a platform. For example, when a software is in the early stages and is still heavily changing at high frequency, then defining a platform that serves the fast evolving needs of developing the software is not feasible. Similarly, small organizations with only a few developers are better off using pre-existing platforms than investing into building their own. Also if the software stack is too simple or too diverse, a platform does not add any value.

If the circumstances are right in principle, there is still a fundamental risk in creating a platform: over-optimization / over-fitting. Any platform naturally imposes constraints by being fitted for a set of specific use-cases. This takes away choices which allows developers to focus on the primary task of development (instead of peripheral tasks like deploying). The risk here is that you optimized the platform in the wrong direction. As a result you may have made the wrong things easier and the wrong things harder in the process. Consider Google Glasses that came out in 2013: the device was purely optimized for the tech side of an augmented reality experience, without considering privacy concerns and social tolerance, resulting in a major failure. By the same token a development / release platform that optimizes heavily for release, while making check-in code changes harder is a certain recipe for disaster. To avoid this pitfall, it’s important to integrate feedback loops and roll out iterations gradually rather than going for a big-bang approach.

Another risk of over-optimization is locking yourself too deeply into a solution that stalls innovation. Yesterday’s great platform can quickly become today’s obstacle. Again, to avoid this the design of your platform should consider changes to the developer-facing interface from the start. Providing a versioned interface makes it possible to describe migration from one version to another. Detaching the interface design itself from the implementation makes it possible to replace the whole implementation.

Platform Engineering ❤️ DevOps?

All of the above is about platforms and platform engineering. Now let’s have a look at how all of that intersects with DevOps.

First, given that DevOps is perpetually lacking a commonly agreed upon definition, let me clarify that I mean the practices that arise from the three principles that are outlined in the DevOps Handbook by Jez Humble et al:

Optimized Flow, throughout all processes by identifying the path of flow, where the bottlenecks are and where work-in-progress accumulates.
Feedback Loops, that act automatically on the signals of pervasive telemetry that is deployed throughout the system.
Continual Learning, that seeks to perpetually improve the processes and thereby the organization.

I certainly don’t want to get into a discussion of whether this is the best definition of DevOps or not. I acknowledge that there are other often cited principals. Thus far, I haven’t encountered a sensible principle that I could not derive from the ones mentioned above. Take, for instance, end-to-end responsibility, which, in my view, exemplifies a practice that ingrates all three principles at team-level. It requires (and at the same time fosters) a DevOps mindset of the whole organization where such teams are embedded in.

Anyhow, if you follow the first principle and identify how work flows through your organization across product- and process boundaries, then it is only opportune to draw circles around common and especially effortful areas and put them behind platforms. The same goes for ubiquitous work that repeats in many or all steps (again, authentication / authorization is a great example). All of that would fit neatly behind the transparent interface of a platform. Maybe your setup is even simple enough and you can draw a single circle around it and have one super platform that reduces the efforts of the whole process.

The second principle of feedback is equally supported through platforms, because for scalability and reliability reasons, you will have pervasive telemetry already deeply integrated into your platform. Exposing those metrics to each platform user as a signal for their feedback loops (of their specific products) is likely little effort.

Continual Learning may seem a bit harder to argue at first glance, because platforms hide complexity and thereby reduce the opportunities for learning for individuals. However, DevOps is about organizations and Platforms certainly supplement organizational learning: they provide a high level signal for a whole category of processes (i.e. all products that use it in their processes) and they provide a focal point to improve that whole category as well. To an extent platforms are also an implementation of organizational knowledge.

So, yes! Platform engineering fits nicely into DevOps adhering organization – even if it developed independently.

Conclusion

I hope I could make clear that platform engineering is not tightly coupled with DevOps – even if it is a great partner to it. Again, I fully endorse the pitch – and I also want to prevent an unnecessary boxing-in of thinking. To me the platform model has a huge range of applications. I can only see this range growing further in the future – especially since increased tech complexity in all kinds of organizations has become an unstoppable trend.

I am writing another article about the end-to-end responsibility practice I mentioned above. My interest here is how platforms enable that practice even at the largest scale. More in my next article.

Ulrich Kautz Blog