SLA Management as a Control Tool: Why Error Budgets Make Operations Predictable
For IT service providers and system houses, agreeing on Service Level Agreements (SLAs) is standard …

Part 2 of our series “Build or Buy Kubernetes”
After exploring in the first part why the decision for Kubernetes goes far beyond choosing a [Container] orchestration platform, the next inevitable question arises: What does it actually cost to operate a Kubernetes platform?
Most companies start this consideration with the wrong numbers.
They compare prices for virtual machines, Managed-Kubernetes offerings, or cloud instances. Calculations include the costs of the Control Plane, Worker Nodes, storage, or outgoing data traffic. These items are easily measurable and appear on every monthly bill from the cloud provider.
Precisely for this reason, they convey a deceptive sense of security.
From a business perspective, infrastructure costs are among the easiest components of a Kubernetes platform to calculate. They follow known scaling models, can be budgeted, and continuously optimized with appropriate tools.
The real costs arise elsewhere—where technological complexity translates into organizational effort.
A productive Kubernetes cluster incurs costs for compute, network, and storage. Beyond a certain scale, load balancers, container registries, observability stacks, or backup infrastructure are added. All these components can be quantified and considered in a total cost of ownership analysis.
It is significantly more challenging to assess the resources necessary to reliably operate this platform over the long term.
A highly available cluster requires not just administrators, but engineers with deep knowledge in Linux, network technologies, storage systems, public key infrastructures, observability, identity and access management, and modern software deployment. Added to this are regulatory requirements, security processes, incident management, and continuous education.
These competencies cannot be scaled arbitrarily.
They develop over years of practical experience.
While an additional Worker Node can be provisioned within minutes, building an experienced platform team often takes several years. This is precisely why qualified personnel represent by far the largest investment factor in platform operations.
The most expensive resource of a Kubernetes platform is rarely the cluster itself. It is the people who ensure its stability.
Interestingly, the introduction of Kubernetes does not necessarily increase infrastructure costs. They often decrease due to better resource utilization, higher automation, and standardized deployments.
What changes, however, is the distribution of costs.
Instead of classic infrastructure investments, there are continuous expenses for platform development, governance, and operational processes. Kubernetes shifts investments from hardware to knowledge.
This shift is often overlooked in many economic analyses.
A company that decides to operate its own platform is not only investing in servers or cloud resources. It is investing in the long-term development of an organization whose task is to continuously evolve an internal platform.
With each additional application, the demand for standardization, documentation, and automation grows. A technical operational environment gradually becomes an internal product, whose users are the company’s own development teams.
This not only changes the technical requirements but also the economic metrics.
In software architecture, complexity is often understood as a technical problem.
In platform operations, however, it is primarily an organizational problem.
Every new component not only expands the technical architecture but also increases the number of possible dependencies, operational states, and error scenarios. With each additional interface, the effort for documentation, test automation, monitoring, incident response, and change management grows.
These connections are rarely immediately visible.
They manifest, for example, in longer coordination processes between development and operations, increasing documentation requirements, more complex release processes, or a growing number of operational exceptions.
The platform does not necessarily become unstable as a result. However, it becomes increasingly difficult to understand.
This is where the concept of Cognitive Load comes in, particularly shaped by the book Team Topologies.
Every development team has only a limited ability to simultaneously understand, operate, and further develop complex systems.
This cognitive load is not a theoretical size but one of the most important factors influencing productivity, error susceptibility, and innovation speed.
When software developers, in addition to their actual domain, must also master network architectures, [Kubernetes] internals, storage concepts, service meshes, certificate management, policy engines, and security processes, the complexity of their daily work increases significantly.
The result is not necessarily worse software.
The result is slower software development.
Every hour a development team spends analyzing a faulty NetworkPolicy is no longer available for further developing the actual product.
This creates opportunity costs that are not reflected in any cloud bill.
Another aspect is often overlooked in the build-or-buy discussion.
According to Conway’s Law, software architectures mirror the communication structures of the organization that develops them.
This connection applies particularly to Kubernetes.
Companies that have multiple development teams working on a common platform inevitably need organizational interfaces. Platform teams define standards. Development teams consume these standards. Security departments formulate guidelines. Compliance officers establish audit processes.
The platform thus becomes the common product of various organizational units.
The larger this organization becomes, the more important clear responsibilities, standardized processes, and consistent governance become.
The introduction of Kubernetes thus not only changes the technical architecture of a company. It changes its collaboration.
At this point, the question inevitably arises whether every company should build its own platform team.
The answer is: not necessarily.
Organizations whose competitive advantage directly arises from their platform competence often benefit from a dedicated platform team. Hyperscalers, SaaS platforms, or companies with highly standardized development processes can achieve significant economies of scale when platform development becomes a core competency.
For many medium-sized software companies, however, the situation is different.
Their economic success does not depend on whether they perform particularly efficient Kubernetes upgrades or develop admission controllers.
It depends on how quickly they can deliver high-quality software to their customers.
In such organizations, it should be critically questioned which tasks truly represent strategic differentiation features and which can be standardized or transferred to specialized platform operators.
Building your own platform team does not just mean creating additional positions. It means taking on permanent responsibility for architecture, operations, governance, documentation, security processes, and continuous development.
This decision should be made as carefully as the development of your own software product.
A reliable economic analysis must therefore go significantly further than comparing different infrastructure offerings.
It should answer the following questions, among others:
Only when these factors are considered together does a realistic picture of the actual costs of a [Kubernetes] platform emerge.
It often becomes apparent that infrastructure accounts for only a relatively small part of the total investment.
Kubernetes is neither cheap nor expensive.
It merely makes visible the organizational investments necessary to reliably operate modern software platforms.
Those who evaluate self-operation solely based on infrastructure costs significantly underestimate the actual effort. What matters are not the costs for virtual machines or managed services, but the long-term investments in people, processes, and organizational maturity.
The actual build-or-buy decision is therefore not a technical one, but an economic and strategic consideration: Which competencies create a sustainable competitive advantage—and which should be consciously standardized to deploy valuable engineering capacities where they provide the greatest benefit to the company?
In the third and final part of this series, we will examine the different operational models in detail. We will analyze why “Managed Kubernetes” often only means a managed Control Plane, what responsibilities still remain with the company, and why the most successful platform strategies in the long run do not rely on maximum dependency, but on transparency, open standards, and systematic knowledge transfer.
For IT service providers and system houses, agreeing on Service Level Agreements (SLAs) is standard …
TL;DR SRE operational guidelines in Kubernetes require clear SLOs, structured runbooks, and …
Transparency over the performance of microservices and distributed architectures is no longer …