I am hoping someone in this community can give us a hand in figuring out why our cluster users keep swapping between the 2 SA cluster members. We have 1 SA 6500 in NY and 1 SA 6500 in NJ. They are an Active/Active cluster. The boxes are never down and there are never any communication issues with the boxes. We see many, many instances where users are swapping between box A and box B (and vice versa) and we cannot figure out why. The reason that this is a big issue for us is we are seeing a ton of Host Checker time out failures. This is not due to inactivity or maximum session length. The HC failure gives no indication of why it is timing out. In all the logs and wireshark captures sent to JTAC they came back saying that we needed persistence on our load balancer. So we implemented persistence. That made no difference. They then asked us to set the persistence inactivity to 12 hours (it was 30 minutes) and that has had no noticeable affect either. JTAC says that they believe that because they see the users bouncing between the 2 SA boxes, Host Checker does not get a proper response in the 60 minute check that we have enabled and therefore HC drops the user.
Note that this swapping of users from box to box also happens in our UK cluster which is Active/Active with 2 SA 6500 devices sitting next to each other but absolutely no users are ever sent to the B box. This exact thing also happens in our Singapore environment with 2 SA 4500 boxes sitting next to each other in an Active/Active cluster but again absolutely no users are ever sent to the secondary box. Yet every day we look at the logs and we see users were on the B boxes.
Although JTAC wants to continue to blame the load balancer we know for sure that is not the case. We think that something is happening on the SA devices (maybe due to a bad configuration???) that is making these users swap back and forth between the boxes.
We have escalated this to the highest level of JTAC via our Sales rep and it has not been rectified. Our end users are screaming as they continue to be dropped "for no reason".
Anyone know what could cause this?
Hello,
I understand your concern and I am very sorry to know what is happening on your cluster and the user experience.
In an effort to be of any help, I have a few questions and something that you may want to try and see if it resolves your issue.
Questions:
- Since when did this issue start?
- Any upgrade on the IVE OS / Any changes on the network / Any changes on the load balancer?
- What load balancer are you using there and have you involved your load balancer team on to this and if yes, what do
they say?
- Do you see any logs on the load balancer regarding this?
- Are there any time pattern you see this issue? (Peak hours / Non-peak hours / Anytime / Always)
- Is there any reason why JTAC feels it's your load balancer issue? (Any specific logs pointing to it?)
I was just wondering if I have ever heard this and yes, I think i remember seeing a scenario like this before.
The solution we tried there was purely by going through the logs from the device and with development help. I would like to share you what we did the other day. However, please be informed that this solution may / may not be similar to your situation - give it a try if you think you would like to.
- SA6500 has something called SSL acceleration chip on the device.
- Disable the acceleration and reboot the machine. Rebooting is very much essential for the changes to take effect.
Impact of disabling SSL acceleration:
- The impact of SSL acceleration on performance varies based on the encryption cipher being used.
- Standard browsers (when doing rewriter) usually use RC4 encryption. The IVE does this encryption in the main
CPU itself (as it is quite light weight). Disabling the accelerator card will not affect performance in such cases.
- However, if you are using a heavier cipher (3DES / AES) and running under load, then the SSL accelerator greatly
offloads the main CPU from the expensive bulk encryption operation. Thus, one can load the system to greater limits
without adversely impacting performance. Typically we have seen that for 3DES encryption, disabling accelerator
card causes the IVE to take about a 30% performance hit, especially if one is doing end-to-end encryption (front-end
and backend SSL).
In summary:
Disabling SSL Hardware Accelerator does not affect 6500 performance (throughput) at the regular packet size of 512 bytes. For small packet sizes, there is a slight degradation for small packet sizes.
Disabling SSL Hardware Accelerator has a 30% degradation for SA6000. The degradation becomes more acute as packet sizes get larger.
NOTE:
Please "DO NOT" disable the SSL hard acceleration if you have many NC (Network Connect) users. Since our NC (Network Connect) client uses 3DES/AES by default, by disabling acceleration, there could be a noticeable performance impact on the system. You may not see a process restart per-se, but packet loss will be greater (for NC) and the page response time will be slower (for core).
Please let me know if the above helps.
Thanks,
Thanks for your response. Below are answers to your questions and some more detail. I am very eager to get this resolved and will provide any detail you require. Note that as this has been escalated to very high level JTAC engineers - each JTAC engineer has put a lot of time and effort into this but unfortunately the issue is still very prevalent. It is not due to lack of effort from Juniper.
- Since when did this issue start?
- Any upgrade on the IVE OS / Any changes on the network / Any changes on the load balancer?
- What load balancer are you using there and have you involved your load balancer team on to this and if yes, what do
they say?
- Do you see any logs on the load balancer regarding this?
- Are there any time pattern you see this issue? (Peak hours / Non-peak hours / Anytime / Always)
- Is there any reason why JTAC feels it's your load balancer issue? (Any specific logs pointing to it?)
PLEASE NOTE - We have SA 6500s in US and UK and SA 4500s in Singapore and in all cases, nearly all users are using Network Connect so we cannot turn off acceleration per your detail. <<<Please "DO NOT" disable the SSL hard acceleration if you have many NC (Network Connect) users. Since our NC (Network Connect) client uses 3DES/AES by default, by disabling acceleration, there could be a noticeable performance impact on the system. You may not see a process restart per-se, but packet loss will be greater (for NC) and the page response time will be slower (for core).>>>>
HOWEVER:::: Either late last year or early this year we were getting complaints of slow response and JTAC suggested that we turn OFF hardware acceleration. So if your points in your post are accurate, could that be why we are still having slow response reports from our end users???? If all of our users are using Network Connect and we have hardware acceleration disabled are we adding to the slow response misery???? See JTAC response about this below:
It has to do with SSL Acceleration option. This feature is enabled by default on all SA6500 devices. It is used to offload SSL operation from the main CPUÓ
Due to some issues seen in the past with enabling the feature, the vendor recommends disabling it. Below is a statement we received from the vendor:
We are recommending that it be disabled for two reasons, out of an abundance of caution, and secondly, it will have no impact on the functionality of the box, you do not require it for the box to work, but you will potentially see a CPU hit of 20-30% as the device approaches maximum user load. PR for the potential SSL issue is 562746.
The secondary devices in UK and Singapore are not used in any way. shape, or form via our network, load balancer, etc. yet we see users boucning over to them in the Juniper logs.
-> You see users boucing in the logs, did you take a network capture on these "inactive" devices to see if there is real user traffic ?
We have SA 6500s in US and UK and SA 4500s in Singapore and in all cases, nearly all users are using Network Connect so we cannot turn off acceleration per your detail.
-> Does DNS resolution of SA cluster gives the same result before and after the NC tunnel is established?
Hello GiantYank,
It may be a little difficult to troubleshoot this issue without viewing the logs. Can you provide me the case number and I will try and follow up internally to see what I can do to help? Thanks
Niol - Thanks for the response.
Kita - The ticket opened with JTAC is 2011-1026-0834. There were several opened by our vendor to Juniper and I believe that this one has the latest and greatest detail. Getting very frustrating as after all of this time and all of the conf calls, testing with the load balancers, escalation to F5 to ensure that the load balancers are perfectly configured and working perfectly, and emails around this we received another email from the JTAC engineer working this blaming it on the load balancers again. See below as it has some detail that may be helpful to anyone tracking these posts. I removed names for privacy and security reasons.
I know you are new to this case but we have beaten the persistence setting on the load balancerÓ topic to death already. It has been set to persistence for over a month now and JTAC requested that the inactivity time out for the persistence on the load balancer be set to 12 hours instead of the 30 minutes that we had it set to previously. That was implemented on Tuesday. Persistence is absolutely, positively, without a doubt, enabled. See screenshot below and note that currently there are 2,774 persistence records for ("URL") RIGHT nowÉ ($ grep "pool \"("URL")" ldns | wc -l >> 2774).
Furthermore, as has been stated several times already, the (other region ive) & (other region ive)Singapore are experiencing the same exact issue and they are set to 100/0 with persistence and inactivity time out set to 12 hours and the failover is the (main region ive) (not the secondary device in those locations). The load balancer has absolutely no knowledge of the secondary boxes in the (other region ive) and (other region ive). Yet, somehow, someway, the users are bouncing back and forth between the primary and secondary IVEs in the (other region ive) and (other region ive) per the logs as well. This is NOT a load balancer persistence issue. I agree with xxxxxx Ð we need to have yet another call on this as we keep going around in circles.
I've worked with Juniper's ssl vpn and F5's load balancers for years and have seen similar things.
One thing to keep in mind is that there is a difference between F5 persistence and timeout.
Also, although an F5 may not be aware of a device, if the F5 is setup L3, then all traffic passes through the F5 and as such timeouts apply. The default F5 timeout is 5 minutes, though we change that to 30 to match our firewall timeouts.
Ask your F5 guys to watch a particular user via "b conn client x.x.x.x server x.x.x.x show all". Look at the last line of output for "IDLE".
My guess is that you will be able to fix this via changing F5 profile timeout values. Have the F5 guys be as granular here as possible so that they don't change global profiles like tcp or fastL4 b/c that will affect the entire F5. Best case would be to attach an irule to the F5 vip for a 12 hour timeout:
when SERVER_CONNECTED {
IP::idle_timeout 43200
}
Thanks - our firewall team - who supports the F5s - worked with F5 support to be sure that we are configured properly and to review the logs in great detail. It is set for persistence and the inactivity is at 12 hours (43,200 seconds) as you suggest. It had been set to 30 minutes before. It didn't make the slightest difference.