Failover Without DNS: How Anycast & BGP Reduce RTO to Under 30 Seconds
David Hussain 3 Minuten Lesezeit

Failover Without DNS: How Anycast & BGP Reduce RTO to Under 30 Seconds

When critical infrastructure fails, every second counts. The key metric here is the RTO (Recovery Time Objective). In many disaster recovery concepts, the bottleneck is not server performance, but the Domain Name System (DNS).

When critical infrastructure fails, every second counts. The key metric here is the RTO (Recovery Time Objective). In many disaster recovery concepts, the bottleneck is not server performance, but the Domain Name System (DNS).

Relying on DNS record switching during a site failure means battling caching times (TTL) and the latency of global name servers. In the KRITIS environment, where time-sensitive data flows and rigid firewall rules dominate, this approach is often too slow and unreliable. The solution lies a layer deeper: in the routing protocol of the internet itself.

The Problem: The Latency of DNS-Based Failovers

Traditional failover scenarios work by changing IP addresses in the DNS. This has three critical disadvantages:

  1. TTL Delays: Even if the TTL (Time-To-Live) is set to 60 seconds, many clients or intermediate nodes ignore this value and cache outdated IP addresses for minutes.
  2. Firewall Issues: In regulated networks (e.g., energy providers), firewalls are often programmed to fixed IP addresses. A new IP in an emergency means connections are blocked until manual approvals occur.
  3. Coordination Effort: With thousands of VPN tunnels or edge devices, an IP change leads to a massive synchronization problem across organizational boundaries.

The Solution: Anycast and the Border Gateway Protocol (BGP)

Instead of changing the IP address, we change the path to the IP. With Anycast, the same IP address (or the same IP prefix) is announced simultaneously from multiple geographically separated locations on the internet.

1. BGP as an Automatic Switchman

The Border Gateway Protocol (BGP) is the language in which routers exchange information about the reachability of IP networks. In a multi-region setup, both locations “announce” via BGP that they are responsible for a specific IP address. Internet routing automatically directs users to the geographically closest, healthy location.

2. Failover Through Route Withdrawal

If a location completely fails, the BGP announcement for that location is withdrawn. Within seconds, the global network “learns” that this path no longer exists. All traffic automatically switches to the second, active location.

  • The Advantage: The IP address remains the same. No DNS entry needs to be changed, no firewall rule needs to be adjusted. The connection is simply rerouted at the network level.

3. Bring Your Own IP (BYOIP)

For KRITIS operators, it is often sensible to use their own, provider-independent IP address ranges. This BYOIP concept allows full control over routing and ensures platform accessibility independent of a single cloud provider’s or data center’s infrastructure.


Conclusion: Routing Beats Runbook

True business continuity in critical environments must not depend on manual processes or unreliable DNS propagation. By using Anycast and BGP, failover becomes an automated network function rather than an organizational task. The result is an RTO that often falls below 30 seconds—a value hardly achievable with traditional methods.


FAQ

What happens to existing TCP connections during a failover with Anycast? Since the routing path changes, existing TCP connections are usually interrupted and need to be re-established by the client. However, since the IP remains the same, this reconnection typically occurs so quickly that users or automated systems hardly notice.

Do I need my own IP address ranges (AS number) for Anycast? Ideally, yes. To have full control over BGP routing, an Autonomous System Number (ASN) and your own IP prefix are advisable. However, there are also cloud providers and partners who offer Anycast as a service on their infrastructure.

Is Anycast suitable for internal communication between locations? Anycast is primarily used for inbound traffic (from outside to the platform). For internal communication between clusters (e.g., database replication), classic unicast connections over dedicated site links are used to specifically target a particular endpoint.

How does Anycast affect latency? Very positively. Since routing always leads the user to the “nearest” location, latency for geographically distributed user groups automatically decreases without the need for complex load balancing logic at the application level.

Ähnliche Artikel