cancel
Showing results for 
Search instead for 
Did you mean: 

pool won't reconnect to failed node after it recovers

SOLVED
jasbro
New Contributor

pool won't reconnect to failed node after it recovers

Two of three nodes failed in a pool overnight with messages like:

Node 123.123.123.123 has failed - Timeout while establishing connection (the machine may be down, or the network congested; increasing the 'max_connect_time' on the pool's 'Connection Management' page may help)

 

When I came in this morning, I was able to verify the ip/port was available and functioning on both nodes, but the pool still had them marked as failed.  I stopped the virtual server and started it back up and the pool connected to all three nodes and has been happy for the last hour.  It seems like it just didn't try to reconnect to those nodes. 

 

I'm new to Stingray load balancers, having only worked with F5 in the past.  Is this normal?  Do I really need to manually intervene to fix this?  Am I missing some obvious config?

 

One attached screenshot is the health monitor config (listed in the catalog as Connect, I'm not sure if that is a standard monitor or was created by my predecessor).

 

The pool also has passive monitoring turned on.

 

The second screen shot shows the connection management settings:

 

Is something in those configs causing the LB to give up on my nodes after some outage and require me to manually intervene?

 

1 ACCEPTED SOLUTION

Accepted Solutions
mbodding
Occasional Contributor

Re: pool won't reconnect to failed node after it recovers

Hi Jasbro,

 

It looks like the pasive monitor did indeed fail your nodes. In which case the passive monitor needs to recover them, a working active monitor will not recover a node which was failed by the passive checks. The most common reasons for a node not recovering from a passive failure are:

 

 1. You don't have any traffic, or

 2. Your traffic is all non-idempotent (eg POSTS)

 

If you traffic is largely POSTs then I would suggest disabling passive monitoring, because without any idempotent requests the passive monitor will never be able to recover the failed nodes.

 

Cheers,

Mark

View solution in original post

4 REPLIES 4
mbodding
Occasional Contributor

Re: pool won't reconnect to failed node after it recovers

Hi Jasbro,

 

It looks like the pasive monitor did indeed fail your nodes. In which case the passive monitor needs to recover them, a working active monitor will not recover a node which was failed by the passive checks. The most common reasons for a node not recovering from a passive failure are:

 

 1. You don't have any traffic, or

 2. Your traffic is all non-idempotent (eg POSTS)

 

If you traffic is largely POSTs then I would suggest disabling passive monitoring, because without any idempotent requests the passive monitor will never be able to recover the failed nodes.

 

Cheers,

Mark

jasbro
New Contributor

Re: pool won't reconnect to failed node after it recovers

That makes sense. In this case it is a pool of SMTP servers in our warm-standby DR environment...so there very little traffic there (a few messages from a daily cron really). Sounds like I need to disable the passive monitoring.
5Lights
New Contributor

Re: pool won't reconnect to failed node after it recovers

I've had this issue too and disabling/re-enabling a node tends to fix this.

 

Why does Passive Monitoring only recover nodes when it sees successful idempotent requests?

Surely a successful POST request with a 200 response in a timely manner should recover the node from a Passive Monitoring failure??

 

This article https://community.pulsesecure.net/t5/Pulse-Secure-vADC/Feature-Brief-Health-Monitoring-in-Stingray-T..., nor the Help guide withins the VTM mention this.

mbodding
Occasional Contributor

Re: pool won't reconnect to failed node after it recovers

Hi 5Lights,

 

It seems I was mistaken. I've believed that non-idempotent requests would not be tried against failed nodes for ever, but it's been pointed out to me today that that is simply not the case.

 

We will try POSTS against failed nodes.

 

I've looked back through the release notes and dug out an ancient copy of ZXTM (5.1r3), and it seems it's been that way for a long time, possibly forever. 

 

image.png

 

So the only reason your traffic wouldn't recover is if you have no traffic, or if session persistance means the node is avoided for other reasons. Apologies for the confusion.

 

Cheers,

Mark