Traffic Manager does not provide a ‘connection mirroring’ or ‘transparent failover’ capability. This article describes contemporary connection mirroring techniques and their strengths and limitations, and explains how Traffic Manager may be used with VMware Fault Tolerance to create an effective solution that preserves all connections in the event of a hardware failure, while processing them fully at layer 7.
A fault tolerant load balancer cluster eliminates single points of failure: When load balancers are deployed in a fault tolerant cluster, they present a reliable endpoint for the services they manage. If one load balancer device fails, its peers are able to step in and accept traffic so the service can continue to operate.
…but a failover event will drop all established TCP connections: However, if a load balancer fails, any TCP connections that are established to that load balancer will be dropped. Clients will either receive a RST or FIN close message, or they may just experience a timeout. The clients will need to re-establish the TCP connection. This is an inconvenience for long-lived protocols that do not support automatic reconnects, such as FTP.
Connection Mirroring offers a solution: If the load balancers are operating in a basic layer-4 packet forwarding mode, the only actions they perform is to NAT the packets to the correct target node, and to apply sequence number translation. They can share this connection information with their peer. If a load balancer fails, the TCP client will retransmit its packets after an appropriate timeout. The packets will be received by the peer who can then apply the correct NAT and sequence number operations.
Connection mirroring is best used when only very basic packet-based load balancing is in use. For example, F5 recommend that you "enable connection mirroring on Performance (Layer 4) virtual servers only" and comment "mirroring short-term connections such as HTTP and UDP is not recommended, as this will cause a decr....
Cisco also support layer 4 connection mirroring (referring to it as ‘Stateful Failover’) and note that it is only possible for layer 4 connections. When using a Cisco ACE device, it is not possible to failover connections that are proxied, including connections that employ SSL decryption or HTTP compression.
Layer 7 connection mirroring puts a very high load on the dedicated heartbeat link (all incoming packets are replicated to the standby peer) and is CPU intensive (both traffic managers must process the same transactions at layer 7). It may add latency or interfere with normal system operation, and not all ADC features are supported in a mirrored configuration. Because of these limitations, F5 advise "the overhead incurred by mirroring HTTP connections is excessive given the minimal advantages."
Due to timing and implementation details, connection mirroring does not guarantee seamless failover. State data must be shared to the peer once the TCP connection is established, and this must be done asynchronously to avoid delaying every TCP connection. If a load balancer fails before it has shared the state information, the TCP session cannot be resumed.
Typical duration of a TCP transaction (not including lingering keepalives) | 500 ms |
Typical window before which state information is synchronized (implementation dependent) |
200 ms (state exchanged 5 times per second) |
On failure, percentage of connections that cannot be re-established | 200/500 = 40% |
Connection mirroring does not guarantee seamless failover because connections must proceed while state is being shared
Connection mirroring carries a cost: increased internal traffic for state sharing, and severe limitations on the functionality that may be used at the load balancing tier. What effect does it have on a service’s uptime?
Typical duration of a TCP transaction (not including lingering keepalives) | 500 ms |
Typical number of individual load balancer failures in a 12 month period | 5 |
Percentage of transactions that would be dropped if a load balancer failed |
50% (assuming an active-active pair of load balancers) |
Percentage of transactions that would be recovered on a failure |
60% (analysis above: 40% would not be recovered) |
What is the probability that an individual connection would be impacted by a load balancer failure? | 500/(365.5*24*3600*1000) * 50% * 5 = 0.000000040 |
What is the probability that connection could be ‘rescued’ with connection mirroring? | 60% = 0.6 |
What proportion of transactions would be impacted by a failure, and then recovered by connection mirroring? |
0.000000040 * 0.6 = 0.000000024 (i.e. 0.0000024%) |
Connection mirroring improves uptime by an infinitesimal amount
Consider using connection mirroring when:
Don’t use connection mirroring when:
Benefits of using Connection Mirroring
Costs of using Connection Mirroring
Balance the benefits of connection limiting against the additional risk and complexity of enabling it and the potential loss in performance and functionality that will result. Be aware that, based on the preceding analysis, unless your goal is to achieve more than 7-9’s uptime (99.99999%), connection mirroring will not measurably contribute to the reliability of your service.
Pulse customers include emergency and first-response services around the world, NGO services publishing disaster-response information and even major political fund-raising concerns. In each case, extremely high availability and consistent performance in the face of large spikes of traffic are paramount to the organizations who selected Traffic Manager.
A number of customers use VMware Fault Tolerance with Traffic Manager to achieve enhanced uptime without compromising the any of the functionality that Traffic Manager offers. VMware Fault Tolerance maintains a perfect shadow of a running virtual machine, running on a separate host. If the primary virtual machine fails due to a catastrophic hardware failure, the shadow seamlessly takes over all traffic, including established connections, with a typical latency of less than 1 ms. All application-level workloads, such as SSL decryption, TrafficScript processing and Authentication are maintained without any interruption in service:
VMware Fault Tolerance runs a secondary virtual machine in ‘lock step’ with the primary. Network traffic and other non-determinstic events are replicated to the secondary, ensuring that it maintains an identical execution state to the primary.
If the primary fails, the secondary takes over seamlessly and a new secondary is started.
Such configurations leverage standard VMware technology and are fully supported. They have been proven in production and offer enhanced connection mirroring functionality compared to proprietary ADC solutions