Single Point of Failure Location: Why One Data Center Isn't Enough for Critical Infrastructure
David Hussain 4 Minuten Lesezeit

Single Point of Failure Location: Why One Data Center Isn’t Enough for Critical Infrastructure

In the world of critical infrastructures (KRITIS), “high availability” is not just a buzzword but a legal and societal obligation. Those who operate control systems for electricity, gas, or heating networks work in an environment where failures can have immediate impacts on public supply security.

In the world of critical infrastructures (KRITIS), “high availability” is not just a buzzword but a legal and societal obligation. Those who operate control systems for electricity, gas, or heating networks work in an environment where failures can have immediate impacts on public supply security.

Many companies feel secure because their platform is redundantly built within a data center (DC): multiple server racks, redundant power supplies, mirrored databases, and a local Kubernetes cluster with multiple control plane nodes. However, this architecture has an Achilles’ heel: it protects against the failure of a component but not against the failure of the location.

1. The Illusion Problem: Locally Redundant is Not Geo-Redundant

As long as a data center is operational, internal redundancy works excellently. But risk analysis for KRITIS-relevant systems must go further. What happens in the event of:

  • Widespread power outages that exceed emergency power capacities?
  • Physical incidents such as fires, floods, or massive fiber optic network damage at the location?
  • Geopolitical or regional disasters that make access to the location impossible?

A system that exists only in one place is - no matter how redundantly it is built internally - a Single Point of Failure (SPOF) at the location level. For regulators and auditors (BSI, NIS-2, Federal Network Agency), this clustering risk is increasingly unacceptable.

2. The Regulatory Pressure: NIS-2 and KRITIS Requirements

Regulatory requirements have tightened. It is no longer sufficient to have a backup that can be restored “sometime” in an emergency.

  • Business Continuity Management (BCM): A demonstrable plan is required for how operations can continue without massive interruption in the event of a site failure.
  • RTO (Recovery Time Objective): The time frame for service restoration must be defined and - above all - technically verifiable. In critical sectors, we are often talking about minutes, not hours.
  • Proof Obligation: Auditors today demand technical evidence and test protocols for emergencies. A mere runbook on paper is classified as a “deficiency.”

3. The Technological Hurdle: DNS Latency and Manual Processes

In many historically grown IT landscapes, a second data center (“Location B”) exists, but the failover process is a manual Herculean task:

  1. Manually start services at Location B.
  2. Load database backups or manually promote replicas.
  3. Switch DNS entries and wait for the global TTL (Time to Live) to expire.
  4. Inform customers that IPs may have changed.

For a KRITIS platform that processes real-time data, this process is far too slow and error-prone. Relying on manual coordination means having a risk, not a disaster recovery plan.

Conclusion: Geo-Redundancy as an Architectural Foundation

True fail-safety begins where the location is considered an interchangeable resource. For KRITIS operators, this means shifting from a single-location logic to an Active/Active Multi-Region Model. Here, workloads run simultaneously at least two geographically separated locations. If one location fails, the other takes over seamlessly - ideally without the end-user or connected systems (like SCADA gateways) noticing the switch.

In the next part of this series, we will look at how to solve this problem at the network level so that a failover can succeed without the latency of DNS switches.


FAQ

Are two availability zones within a cloud provider enough? Often, availability zones are located in the same city or region (e.g., Frankfurt). In a large-scale event (flood, power grid collapse), all zones could be affected simultaneously. For KRITIS, a true geographical distance (e.g., > 100 km) is often required.

Is geo-redundancy too expensive for medium-sized platforms? While the infrastructure costs nearly double, the use of Kubernetes and automation reduces the operational effort for disaster recovery tests and maintenance. The greatest cost risk is the penalty for a prolonged outage or the revocation of the operating license by auditors.

What is the difference between Disaster Recovery and Business Continuity? Disaster Recovery (DR) focuses on recovery after a failure (often with data loss/downtime). Business Continuity (BC) aims to maintain operations despite the failure without noticeable interruption. KRITIS increasingly demands BC.

Can we simply “double” host our existing application? Technically yes, but the challenge lies in data synchronization and traffic routing. An application must be designed or adapted to be “Cloud-Native” to function consistently in a multi-region setup.

How does ayedo support risk analysis? We conduct a technical audit of your current infrastructure, identify single points of failure, and develop a roadmap for a geo-redundant target architecture that is both technically stable and audit-compliant.

Ähnliche Artikel