Observability for MLOps: More Than Just Monitoring CPU and RAM
In the traditional IT world, things are binary: A server is either running or it’s not. A …

In the world of critical infrastructures (KRITIS), the success of a disaster recovery concept is often measured by hard metrics like the RTO (Recovery Time Objective). However, there is a “soft” metric that determines acceptance or chaos in practice: The user experience at the moment of switchover.
Imagine an operator in a network control center coordinating a critical switching operation via a web interface. In the background, a data center fails, and traffic shifts to the backup location within seconds. If the operator suddenly lands on a login page and loses their session, the technical failover may have succeeded, but the operational process is dangerously interrupted.
True business continuity means that the user session survives the location change.
By default, many applications store session information (sessions) locally in the memory of the respective server or in a local cache.
To prevent this scenario, we decouple session management from the local application instance. We use a distributed in-memory store (usually Redis), which acts as a global “source of truth” for identities.
In addition to server-side replication, we use modern standards like JSON Web Tokens (JWT). Since these tokens are cryptographically signed and contain all necessary user information directly, each location can independently verify the validity of a session—even if the database connection between regions is interrupted for a few seconds.
This significantly increases resilience: The user remains “in,” even if the infrastructure is working hard in the background to reorganize itself.
An invisible failover is the highest quality feature of a KRITIS platform. By ensuring that sessions and states are georedundantly available, we not only protect IT systems but also the work processes of the people who operate these systems. The location change becomes a potential crisis to a mere background event.
Does session replication lead to high network load between locations? No. Session data is typically very small (a few kilobytes). Even with thousands of simultaneous users, the bandwidth required for replication is negligible compared to database or video streams.
What happens if replication lags by a second? In extremely rare cases (race condition), a user might switch exactly in the millisecond when their session has not yet arrived in Region B. Here, “graceful degradation” mechanisms kick in: The application attempts a short retry before prompting the user to log in.
Does my application need to be specially programmed for this scenario? Yes, the application must not store states locally in the file system or RAM. This is known as “stateless applications,” which offload their states to external services like Redis. This is a cornerstone of modern cloud-native architecture.
How secure are session data during transmission? Replication between Redis instances occurs over encrypted tunnels (e.g., via Cilium Cluster Mesh or TLS), ensuring that session information never flows in plaintext over the wide area network.
How does ayedo support the implementation of session persistence? We analyze your application architecture, implement highly available Redis clusters in your regions, and configure the necessary replication pipelines. We ensure that your failover not only works technically but also feels seamless to your users.
In the traditional IT world, things are binary: A server is either running or it’s not. A …
TL;DR In the microservices world, services need a way to communicate. Tools like RabbitMQ (based on …
In 2026, sustainability in the IT sector is no longer a “nice-to-have” for marketing …