The Frankfurt Dilemma: Why Location Redundancy Isn't Enough for Critical Infrastructure

The Frankfurt Dilemma: Why Location Redundancy Isn’t Enough for Critical Infrastructure

Operators of critical infrastructures (KRITIS) invest heavily in fail-safety. However, this planning often ends at the data center’s property line. A typical setup in Frankfurt or Berlin looks like this: redundant power supply, two fire compartments, a highly available Kubernetes cluster across multiple racks, and replicated databases.

On paper, this achieves an availability of 99.99%. Yet for KRITIS-relevant systems, this is often a dangerous illusion. This model assumes that the entire location will never fail.

The Invisible Threat: The Single Point of Failure “Region”

A location failure—whether due to a widespread power outage, a massive network error at the main provider, or physical events—undermines any internal redundancy. If the data center or the region goes offline, even ten replicas of a database are useless if they all reside in the same area.

For operators of electricity, gas, or heating networks, as well as financial service providers, this is not a theoretical scenario but a regulatory risk. Audits under NIS-2 or the BSI Act increasingly demand proof that services continue to run even if an entire geographic node disappears.

The Solution: From Redundancy to Geo-Redundancy

To achieve true resilience, the architecture must leave the location. The path leads away from the “one fortress” to a distributed system. A robust solution approach consists of three pillars:

1. Decoupled Clusters Instead of Stretched Systems

Instead of painstakingly stretching a single Kubernetes cluster over two cities (“Stretched Cluster”), it has proven effective to operate a fully autonomous cluster per region. This limits the so-called Blast Radius: A software error or configuration problem in Region A cannot drag Region B down with it.

2. Active/Active Operation Instead of Cold Backups

A disaster recovery site that is only activated in an emergency usually doesn’t work in practice. A modern KRITIS architecture uses both locations simultaneously (Active/Active). Traffic is permanently distributed across both regions. This ensures that the infrastructure at each location is tested under real load at every second.

3. Intelligent Routing at the Network Level

So that users and connected systems (e.g., SCADA control technology) do not have to wait for manual interventions in case of failure, failover is shifted to the network. Through techniques like Anycast routing via BGP, the global network automatically detects when a location is no longer reachable and redirects traffic to the healthy location within milliseconds—without needing to change IP addresses or DNS entries.

Conclusion: Resilience is a Matter of Location

True fail-safety for critical systems begins where dependency on a single data center ends. Those planning infrastructure today should not consider geo-redundancy as an “additional option” for later but as the architectural foundation. Only those who can demonstrably prove that a regional total failure does not endanger service quality meet the high demands of modern regulation.

FAQ

Why isn’t a backup in the second data center enough? A backup secures the data but not the availability. Restoring backups and manually switching DNS entries often takes hours. KRITIS requirements usually demand recovery times (RTO) in the range of minutes or seconds, which can only be achieved through actively running systems.

Doesn’t multi-region operation massively increase complexity? Complexity does indeed increase, but it can be managed through modern orchestration tools and GitOps workflows. The gain in security and the ability to perform maintenance during operation almost always outweigh the administrative overhead for critical systems.

Are there minimum distance requirements between locations? Yes, the BSI often recommends a minimum distance (e.g., 100 km to 200 km) for geo-redundant setups to ensure that large-scale disasters do not affect both locations simultaneously. The exact requirements, however, depend on specific regulations (e.g., KritisV).

What is the difference between high availability and disaster recovery? High availability protects against the failure of individual components (e.g., server or hard drive). Disaster recovery (and geo-redundancy) protects against catastrophic events that cripple entire infrastructures or locations.

The Frankfurt Dilemma: Why Location Redundancy Isn’t Enough for Critical Infrastructure

The Invisible Threat: The Single Point of Failure “Region”

The Solution: From Redundancy to Geo-Redundancy

1. Decoupled Clusters Instead of Stretched Systems

2. Active/Active Operation Instead of Cold Backups

3. Intelligent Routing at the Network Level

Conclusion: Resilience is a Matter of Location

FAQ

Ähnliche Artikel

DORA-ready in the Financial Sector: What ICT Third-Party Risk Management Means for DNS

Polycrate IaC: Security in IaC Pipelines and Secrets Management

Security and Compliance Aspects in Polycrate Platform Operations