Multi-Region Kubernetes for Critical Infrastructure: How ayedo Transformed a Single-Site Platform into an Active/Active Anycast System

Operating critical infrastructure requires more than just “good availability”—it demands demonstrable resilience against failures. This need extends beyond a single data center to encompass multiple locations. Many platforms, although technically sound, fail in this regard because they have historically evolved within a single region: redundant in the rack, redundant in the cluster, redundant in the database—yet still a single point of failure at the site level.

In this post, we illustrate through an anonymized customer project how ayedo elevated a critical infrastructure control and monitoring platform from a single data center to a multi-region architecture—with active/active Kubernetes clusters, BGP-based Anycast routing, Bring-Your-Own-IP, and automated failover that demonstrably achieves an RTO of under 30 seconds in tests.

The customer remains anonymous. The approach is reproducible—especially for organizations that must not only document regulatory business continuity but also prove it technically.

Initial Situation: Highly Available—But Only Within One Location

The customer operates a central platform for operators of electricity, gas, and heating networks. This platform collects network status data, coordinates switching actions, and automates regulatory reporting. The user base is accordingly sensitive: several dozen network operators rely on these systems for 24/7 operations. The environment is critical infrastructure, with requirements from the BSI Act, NIS-2, and the IT Security Catalog of the Federal Network Agency.

Technically, the platform was implemented in Frankfurt on a Kubernetes cluster, classically redundant within the location: multiple control plane nodes, multiple workers, replicated databases, load balancing. For the first few years, this was sufficient. The team managed typical outages, rolling updates were established, and node failures were manageable.

The problem was not with the cluster, but with the location.

As long as “Frankfurt available” is an assumption, such a setup seems solid. Once you seriously question what happens in the event of a site failure—fire, power, network, physical incident, widespread disruption—the architecture collapses. Redundancy within a location does not help against the failure of the location itself.

This question became existential as the product’s importance grew. An audit criticized the lack of geo-redundancy as a major deficiency and set a deadline for remediation. At the same time, customer requirements became more specific: multi-location as a minimum requirement, demonstrable disaster recovery, reliable recovery times.

Why “Manual Disaster Recovery” is Unacceptable in Critical Infrastructure

There was a disaster recovery plan—but it was fundamentally manual: restore backups, switch DNS, start services, check dependencies, inform customers. The estimated recovery time was several hours. In many business systems, this would be painful but tolerable. In a platform that supports real-time functions for network operators and is subject to regulatory oversight, this is unacceptable.

This is not only due to the pure downtime but also the follow-on effects. Network operators sometimes have fixed firewall rules, dedicated VPN tunnels, and strictly defined IP-based permissions. If the platform IP changes, it does not create “a change,” but a coordinated project across many organizations—with approvals, maintenance windows, and permissions. In reality, such a process takes not hours, but weeks.

The customer thus faced a second, often underestimated problem: even if a technical failover could be executed quickly, network access on the customer side would be a bottleneck. This is why classic “switch DNS” DR concepts regularly fail in such environments: DNS is not the problem; the network reality of the customers is.

At the same time, latency issues emerged for more distant users. SCADA integrations and visualizations with sub-second requirements react sensitively to additional milliseconds. A central location creates noticeable disadvantages for certain regions—and the pressure is increasing not only to be fail-safe but also closer to the customer.

Finally, maintenance windows became a risk factor. Cluster upgrades, OS patches, and infrastructure work in a single region always involve playing with reduced redundancy. Even if technically “nothing should fail,” a residual risk remains that customers in the critical infrastructure context no longer want to accept. The expectation was clear: maintenance must not endanger operations—it must be architecturally decoupled.

The Turning Point: Customer Demand + Audit Deadline + Measurable RTO

The moment when “we should” became “we must now” came from two factors simultaneously: the audit deadline and a major network operator who explicitly tied a contract renewal to geo-redundant architecture and automated failover—with an RTO of under 60 seconds.

Thus, the target definition was no longer debatable. It was not about “a second data center as a backup,” but about demonstrable, automatic takeover times in the seconds range. And it was not about a one-time setup, but about an architecture that can be maintained without maintenance windows and audited without paperwork acrobatics.

This is where we, as ayedo, stepped in.

ayedo’s Approach: Multi-Region as an Operational Model—Not an Emergency Plan

We approached the task as a platform and network problem, not as “setting up a second cluster.” True geo-redundancy only arises when three things are solved simultaneously:

First: Traffic must be able to switch without DNS changes and without customer intervention. Second: Workloads must run active/active so that failover means “continue running” rather than “start up.” Third: The proof must be systemically generated—through regular tests, measurable, documentable.

The result was an architecture that decouples locations, makes operations maintainable, and declares failover as a function of the network and platform—not a runbook.

Architecture: Three Availability Zones, Two Regions, Active/Active Clusters per Region

Instead of stretching a single Kubernetes cluster across locations, we established dedicated clusters per region. This is a deliberate decision. A “stretched cluster” sounds elegant on paper but brings significant disadvantages in operation: complexity in the control plane, sensitive dependency on interconnects, difficult debugging, and a higher likelihood that a partial problem becomes a cluster problem.

With separate clusters per region, the blast radius can be cleanly limited. An entire region can fail without compromising the other. Both regions are active and process traffic simultaneously. Thus, there is no “cold” site that must prove it works in an emergency. Each region is stressed, visible, and tested in everyday operations.

The locations were connected via redundant, dedicated links with low latency to enable stable replication and central control—without a hard coupling that turns a network problem into a platform problem.

Anycast and BGP: Failover Without DNS and Without Customer Coordination

The central lever for the customer side was Anycast—implemented via BGP. Instead of binding a single IP to one location, the same IP is announced in both regions. For the client, everything remains the same: same IP, same firewall rule, same VPN tunnel. The difference lies only in the route behind it.

If a region fails, BGP automatically withdraws the route. Traffic flows to the remaining region—without DNS changes, without customers having to touch firewalls, without adjusting VPN configurations. In critical environments, this is the difference between “we can technically failover” and “we can truly failover.”

The concept was further extended with Bring-Your-Own-IP because some enterprise customers want to use provider-independent address space. Such requirements are common in highly regulated networks: customers want IP ownership and the ability to announce prefixes via BGP themselves or through a service provider. BYOIP is not a “nice feature” in these projects but often an onboarding criterion.

With the Anycast architecture, a second effect was also achieved: latency. Clients automatically land at the nearest, healthy endpoint. For distant regions, this noticeably reduces round-trip time—especially for interactions composed of many small requests.

Network and Policies: Cilium Cluster Mesh as a Connecting Layer

In multi-cluster setups, the network is not just transport but also security and operational logic. We used Cilium as the CNI and Cluster Mesh to enable service discovery and load balancing across cluster boundaries without fragmenting security policies.

The crucial aspect is not only that workloads can communicate with each other but that network policies are enforced centrally and consistently. In regulated environments, “identical policies in all regions” is both an audit argument and an operational argument: less drift, fewer special cases, less risk.

Data Layer: Replication That Accepts the Reality of Regions

Geo-redundancy often fails due to the illusion that “everything can be synchronous” across regions. In practice, synchronous replication across regions with strict latency requirements and high availability is a trade-off that rarely works. Therefore, the data architecture was deliberately designed in two stages: synchronous within a region, asynchronous between regions—with mechanisms that guarantee consistency without blocking the platform.

For PostgreSQL, this means local high availability through synchronous replication within the region, plus asynchronous cross-region replication as a basis for failover. This is complemented by point-in-time recovery and geo-redundant backups, so recovery can not only “continue running” but also “roll back” when data correctness is concerned.

Redis was designed to keep sessions available across regions. This may sound like a detail, but it is crucial in the event of a failover: if users have to re-authenticate or sessions are lost after a region switch, failover is technically successful but practically disruptive. In critical systems, user perception counts.

RabbitMQ was coupled across regions via federation to keep asynchronous processing robust. Especially in SCADA-related systems, messaging is not a convenience but the foundation for ensuring that data is not lost even during transient problems.

Secrets and certificates were replicated across regions using Vault, so failover does not fail due to “missing credentials.” Here too, the rule applies: failover is only as fast as the slowest dependency. If secrets have to be synchronized manually, there is effectively no automated failover.

GitOps Across Regions: ArgoCD as Multi-Cluster Control

Multi-region architecture only brings true stability if both regions are operated identically. Different versions, different configs, different policies are an invitation to “failover into surprise.”

Multi-Region Infrastructure

Multi-Region Kubernetes for Critical Infrastructure: How ayedo Transformed a Single-Site Platform into an Active/Active Anycast System

Multi-Region Kubernetes for Critical Infrastructure: How ayedo Transformed a Single-Site Platform into an Active/Active Anycast System

Initial Situation: Highly Available—But Only Within One Location

Why “Manual Disaster Recovery” is Unacceptable in Critical Infrastructure

The Turning Point: Customer Demand + Audit Deadline + Measurable RTO

ayedo’s Approach: Multi-Region as an Operational Model—Not an Emergency Plan

Architecture: Three Availability Zones, Two Regions, Active/Active Clusters per Region

Anycast and BGP: Failover Without DNS and Without Customer Coordination

Network and Policies: Cilium Cluster Mesh as a Connecting Layer

Data Layer: Replication That Accepts the Reality of Regions

GitOps Across Regions: ArgoCD as Multi-Cluster Control

Diesen Use Case umsetzen?

Weitere Use Cases

Video Processing

SaaS Apps

Machine Learning

Multi-Region Infrastructure

Multi-Region Kubernetes for Critical Infrastructure: How ayedo Transformed a Single-Site Platform into an Active/Active Anycast System

Multi-Region Kubernetes for Critical Infrastructure: How ayedo Transformed a Single-Site Platform into an Active/Active Anycast System

Initial Situation: Highly Available—But Only Within One Location

Why “Manual Disaster Recovery” is Unacceptable in Critical Infrastructure

The Turning Point: Customer Demand + Audit Deadline + Measurable RTO

ayedo’s Approach: Multi-Region as an Operational Model—Not an Emergency Plan

Architecture: Three Availability Zones, Two Regions, Active/Active Clusters per Region

Anycast and BGP: Failover Without DNS and Without Customer Coordination

Network and Policies: Cilium Cluster Mesh as a Connecting Layer

Data Layer: Replication That Accepts the Reality of Regions

GitOps Across Regions: ArgoCD as Multi-Cluster Control

Diesen Use Case umsetzen?

Weitere Use Cases

Video Processing

SaaS Apps

Machine Learning

Kontakt aufnehmen