Two of three nodes failed in a pool overnight with messages like:
Node 18.104.22.168 has failed - Timeout while establishing connection (the machine may be down, or the network congested; increasing the 'max_connect_time' on the pool's 'Connection Management' page may help)
When I came in this morning, I was able to verify the ip/port was available and functioning on both nodes, but the pool still had them marked as failed. I stopped the virtual server and started it back up and the pool connected to all three nodes and has been happy for the last hour. It seems like it just didn't try to reconnect to those nodes.
I'm new to Stingray load balancers, having only worked with F5 in the past. Is this normal? Do I really need to manually intervene to fix this? Am I missing some obvious config?
One attached screenshot is the health monitor config (listed in the catalog as Connect, I'm not sure if that is a standard monitor or was created by my predecessor).
The pool also has passive monitoring turned on.
The second screen shot shows the connection management settings:
Is something in those configs causing the LB to give up on my nodes after some outage and require me to manually intervene?
Solved! Go to Solution.
It looks like the pasive monitor did indeed fail your nodes. In which case the passive monitor needs to recover them, a working active monitor will not recover a node which was failed by the passive checks. The most common reasons for a node not recovering from a passive failure are:
1. You don't have any traffic, or
2. Your traffic is all non-idempotent (eg POSTS)
If you traffic is largely POSTs then I would suggest disabling passive monitoring, because without any idempotent requests the passive monitor will never be able to recover the failed nodes.
I've had this issue too and disabling/re-enabling a node tends to fix this.
Why does Passive Monitoring only recover nodes when it sees successful idempotent requests?
Surely a successful POST request with a 200 response in a timely manner should recover the node from a Passive Monitoring failure??
This article https://community.pulsesecure.net/t5/Pulse-Secure-vADC/Feature-Brief-Health-Monitoring-in-Stingray-T..., nor the Help guide withins the VTM mention this.
It seems I was mistaken. I've believed that non-idempotent requests would not be tried against failed nodes for ever, but it's been pointed out to me today that that is simply not the case.
We will try POSTS against failed nodes.
I've looked back through the release notes and dug out an ancient copy of ZXTM (5.1r3), and it seems it's been that way for a long time, possibly forever.
So the only reason your traffic wouldn't recover is if you have no traffic, or if session persistance means the node is avoided for other reasons. Apologies for the confusion.