From Notebook to Production: Why "Productization" Shouldn't Be a Manual Step
In the world of Artificial Intelligence, there’s a phenomenon we often refer to as the …

Kubernetes is an open-source platform for orchestrating containerized applications. It automates deployment, scaling, and operations, forming the foundation of modern cloud-native infrastructures. At its core, Kubernetes is based on a declarative model: the desired state is described, and controllers continuously ensure that this state is achieved and maintained.
With version 1.36, a topic comes into focus that has long been underestimated but deeply impacts the reliability of Kubernetes: Staleness in Controllers. The innovations may seem technically unspectacular at first glance, but conceptually they are a significant step towards more robust and comprehensible systems.
To understand why these changes are relevant, one must briefly consider how controllers work.
A controller observes the state of resources through so-called informers. These, in turn, maintain a local cache that is updated via watch events from the API server. This design is deliberately chosen: it reduces load on the API server and enables quick reactions.
The price for this is obvious—and has long been accepted:
The controller does not operate on reality but on a possibly outdated representation of that reality.
In many cases, this is unproblematic. After all, Kubernetes is designed for eventual consistency. But in highly dynamic environments, this is where the problem begins.
For instance, if a controller has just made a change—such as scaling or replacing pods—it may happen that its own cache does not yet reflect this change. The controller “doesn’t know” that it has just acted.
This leads to subtle but real problems: decisions are made twice, necessary actions are omitted or occur too late. This is particularly critical for highly frequented resources like pods that are subject to high change rates.
The insidious thing about staleness is not only its existence but its invisibility in the normal development process.
In test environments, clusters are usually small, latencies are low, and event flows are manageable. Under these conditions, controllers seem to function correctly. It is only in production setups with:
that it becomes apparent that implicit assumptions no longer hold.
Typical symptoms are hard to reproduce: a controller “sometimes reacts incorrectly,” “occasionally scales too late,” or behaves “inconsistently under load.” Without deeper observability, it was almost impossible to analyze these effects cleanly until now.
The innovations in Kubernetes 1.36 address this very issue. Instead of attempting to change the underlying consistency model—which would be hardly feasible in a distributed system—a pragmatic approach is pursued:
Controllers should be able to recognize when their worldview is outdated—and consciously not act in such cases.
This is an important paradigm shift. Previously, the implicit rule was: Receive event → Execute reconcile. Now it is: Receive event → Check consistency → Possibly suspend reconcile.
At the heart of this logic is the use of Resource Versions.
Every change to a Kubernetes object generates a new resource version. This can be understood as a kind of monotonic timeline. Kubernetes 1.36 deliberately exploits this property.
Through extensions in client-go, controllers can now:
The crucial question is:
Does my cache have at least the state that I last wrote myself?
If the answer is “no,” the controller is working with outdated data—and should not make further decisions.
An often overlooked aspect is that inconsistency does not only arise during reading but already during event processing.
Before Kubernetes 1.36, events were processed in the order they arrived. This sounds logical but is problematic in distributed systems because order cannot be guaranteed. Especially during the initial cache build-up (List + Watch), states could arise that never existed in the cluster.
With the introduction of Atomic FIFO, this problem is addressed. Events—especially from initial list operations—are processed atomically. This ensures that the cache represents a consistent state at all times.
This change works in the background but is crucial: without a consistent cache, any subsequent staleness detection is worthless.
The real strength of the innovation is shown in its application in the kube-controller-manager.
Controllers like the ReplicaSet, StatefulSet, or Job controller now check before each reconciliation whether their cache is current enough. If it is not, the cycle is skipped.
This means concretely: The controller actively waits for its own worldview to “catch up” before acting again.
This behavior corresponds to a concept that is self-evident in many other systems but has been missing in Kubernetes until now:
Read your own writes
The ability to reliably see one’s own changes before making further decisions.
For developers of custom controllers, Kubernetes 1.36 provides the ConsistencyStore tool to implement this logic themselves.
The principle is deliberately kept simple: The controller remembers which resource version it last wrote and checks before each further action whether the cache has already reached this state.
Interestingly, Kubernetes does not impose a compulsion here but establishes a pattern. This means: The responsibility for correct behavior remains with the developer—but the necessary tools are now available.
Especially in the context of operators and individual automations, this is an important step. Many custom controllers today suffer from exactly the described staleness problems without the authors being aware of it.
In addition to the actual mitigation, the second major innovation is improved observability.
With the new metric stale_sync_skips_total, it is now possible to quantify how often a controller consciously did not act due to an outdated cache. This is more than just a debugging aid—it is an indicator of the “health” of the control loops.
Additionally, informers now output their current resource version as a metric. This makes it possible to visualize how far a cache lags behind the actual cluster state.
In practice, this opens up new possibilities:
What was previously a black box behavior is now transparent for the first time.
Kubernetes remains a system that relies on eventual consistency—and for good reason. Strong consistency would massively limit scalability.
The innovations in v1.36 do not attempt to replace this model. Instead, they create a mechanism to deal more consciously with inconsistency within this model.
The result is not a fully deterministic system, but a significantly better controllable one.
Especially in scenarios with:
the difference becomes noticeable.
A particularly important point is the planned transfer of these mechanisms into controller-runtime. This would mean that all operators based on it would automatically benefit from staleness mitigation.
This would be a significant step towards standardization: No longer does each controller implement its own—often error-prone—logic, but consistency becomes part of the framework.
Kubernetes v1.36 does not deliver spectacular new APIs but addresses a fundamental problem in the operation of distributed systems: making decisions based on outdated information.
The introduced mechanisms ensure that controllers can better assess their own behavior. They no longer act blindly on events but consider the state of their own perception.
For operators, this means more stable systems. For developers, it means clearer guidelines. And for Kubernetes as a whole, it is a step towards greater maturity—where it really counts: in behavior under real conditions.
In the world of Artificial Intelligence, there’s a phenomenon we often refer to as the …
In software development, the problem has long been solved: Code is versioned in Git, isolated in …
In the world of data engineering, Apache Airflow is the undisputed champion for workflow …