This technical brief discusses Stingray's Clustering and Fault Tolerance mechanisms ('TrafficCluster').
Stingray Traffic Managers are routinely deployed in clusters of two or more devices, for fault-tolerance and scalability reasons:
A cluster is a set of traffic managers that share the same basic configuration (for variations, see 'Multi-Site Management' below). These traffic managers act as a single unit with respect to the web-based administration and monitoring interface: configuration updates are automatically propagated across the cluster, and diagnostics, logs and statistics are automatically gathered and merged by the admin interface.
There is no explicit or implicit 'master' in a Stingray cluster - like the Knights of the Round Table, all Stingrays have equal status. This design improves reliability of the cluster as there is no need to nominate and track the status of a single master device.
Administrators can use the admin interface on any Stingray device to manage the cluster. Intra-cluster communication is secured using SSL and fingerprinting to protect against the interception of configuration updates or false impersonation (man-in-the-middle) of a cluster member.
Note: You can remove administration privileges from selected traffic managers by disabling the control!canupdate configuration setting for those traffic managers. Once a traffic manager is restricted in that way, its peers will refuse to accept configuration updates from it, and the administration interface is disabled on that traffic manager. If a restricted traffic manager is in some way compromised, it cannot be used to further compromise the other traffic managers in your cluster.
If you find yourself in the position that you cannot access any of the unrestricted traffic managers and you need to promote a restricted traffic manager to regain control, please refer to the technical note What to do if you need to access a restricted Stingray Traffic Manager.
Incoming network traffic is distributed across a cluster using a concept named 'Traffic IP Groups'.
A Traffic IP Group contains a set of floating (virtual) IP addresses (known as 'Traffic IPs') and it spans some or all of the traffic managers in a cluster:
The Stingray Cluster contains traffic managers 1, 2, 3 and 4.
Traffic IP Group A contains traffic IP addresses 'x' and 'y' and is managed by traffic managers 1, 2 and 3.
Traffic IP Group B contains traffic IP address 'z' and is managed by traffic managers 3 and 4.
The traffic managers handle traffic that is destined to the traffic IP addresses using one of two methods:
Single-hosted is typically easier to manage and debug in the event of problems because all of the traffic to a traffic IP address is targetted to the same traffic manager. In high-traffic environments, it's common to assign multiple IP addresses to a single-hosted traffic IP group and let the traffic managers distribute those IP addresses evenly. Publish all of the IP addresses in a round-robin DNS fashion. This gives approximately even distribution of traffic across these IP addresses.
Multi-hosted traffic IP groups are more challenging to manage, but they have the advantage that all traffic is evenly distributed across the machines that manage the traffic IP group.
For more information, refer to the article Feature Brief: Deep-dive on Multi-Hosted IP addresses in Stingray Traffic Manager
If possible, you should use single-hosted traffic IP groups in very high traffic environments. Although multi-hosted gives even traffic distribution, this comes at a cost:
The traffic managers in a cluster each perform frequent self-tests, verifying network connectivity, correct operation and internal self-tests. They broadcast health messages periodically (every 500 ms by default - see flipper!monitor_interval) and listen for the health messages from their peers.
If a traffic manager fails, it either broadcasts a health message indicating the problem, or (in the event of a catastrophic situation) it stops broadcasting health messages completely. Either way, its peers in the Stingray cluster will rapidly identify that it has failed.
In this situation, two actions are taken:
Note that if a traffic manager fails, it will voluntarily drop any traffic IP addresses that it is responsible for.
If a traffic manager fails, the traffic IP addresses that it is responsible for are redistributed. The goal of the redistribution method is to share the orphaned IP responsibilities as evenly as possible with the remaining traffic managers in the group, without reassigning any other IP allocations. This minimizes disruption and seeks to ensure that traffic is as evenly shared as possible across the remaining cluster members.
The single-hosted method is granular to the level of individual traffic IP addresses. The failover method is described in the article How are single-hosted traffic IP addresses distributed in a Stingray cluster (TODO).
The multi-hosted method is granular to the level of an individual TCP connection. It's failover method is described in the article How are multihosted-hosted traffic IP addresses distributed in a Stingray cluster (TODO).
Stingray machines within a cluster will share some state information:
Stingray does not share detailed connection information across a cluster (SSL state, rules state etc), so if a Stingray Traffic Manager were to fail, any TCP connections it is currently managing will be dropped. You can guarantee that no connections are ever dropped by using a technique like VMware Fault Tolerance to run a shadow traffic manager that tracks the state of the active traffic manager completely. This solution is supported by Riverbed and is in use in a number of deployments where 5- or 6-9's uptime is not sufficient:
VMware Fault Tolerance is used to ensure that no connections are dropped in the event of a Stingray failure
Recall that all of the traffic managers in a Stingray cluster have identical copies of configuration and therefore will operate in identical fashions.
Stingray Traffic Manager clusters may span multiple locations, and in some situations, you may need to run slightly different configurations in each location. For example, you may wish to use a different pool of web servers when your service is running in your New York datacenter compared to your Docklands datacenter.
In simple situations, this can be achieved with judicious use of TrafficScript to apply slightly different traffic management actions based on the identity of the traffic manager that is processing the request (sys.hostname()), or the IP address that the request was received on:
$ip = request.getLocalIP();
# Traffic IPs in the range 31.44.1.* are hosted in Docklands
if( string.ipmaskMatch( $ip, "220.127.116.11/24" ) )
pool.select( "Docklands Webservers" );
# Traffic IPs in the range 154.76.87.* are hosted in New Jersey
if( string.ipmaskMatch( $ip, "18.104.22.168/24" ) )
pool.select( "New Jersey Webservers" );
In more complex situations, you can enable the Multi-Site Management option for the Stingray configuration. This option allows you to apply a layer of templating to your configuration - you define a set of locations, assign each traffic manager to one of these locations, and then you can template individual configuration keys so that they take different values depending on the location in which the configuration is read.
There are limitations to the scope of Multi-Site Manager (it currently does not interoperate with The specified space was not found. and the REST API is not able to manage configuration that is templated using Multi-Site Manager). Please refer to the What is Stingray Multi-Site Manager? feature brief for more information, and to the relevant chapter in the Stingray Product Documentation for details of limitations and caveats.