cancel
Showing results for 
Search instead for 
Did you mean: 

Critical problem with SSL VPN cluster with users switching between SA devices

GiantYank_
Occasional Contributor

Critical problem with SSL VPN cluster with users switching between SA devices

I am hoping someone in this community can give us a hand in figuring out why our cluster users keep swapping between the 2 SA cluster members. We have 1 SA 6500 in NY and 1 SA 6500 in NJ. They are an Active/Active cluster. The boxes are never down and there are never any communication issues with the boxes. We see many, many instances where users are swapping between box A and box B (and vice versa) and we cannot figure out why. The reason that this is a big issue for us is we are seeing a ton of Host Checker time out failures. This is not due to inactivity or maximum session length. The HC failure gives no indication of why it is timing out. In all the logs and wireshark captures sent to JTAC they came back saying that we needed persistence on our load balancer. So we implemented persistence. That made no difference. They then asked us to set the persistence inactivity to 12 hours (it was 30 minutes) and that has had no noticeable affect either. JTAC says that they believe that because they see the users bouncing between the 2 SA boxes, Host Checker does not get a proper response in the 60 minute check that we have enabled and therefore HC drops the user.

Note that this swapping of users from box to box also happens in our UK cluster which is Active/Active with 2 SA 6500 devices sitting next to each other but absolutely no users are ever sent to the B box. This exact thing also happens in our Singapore environment with 2 SA 4500 boxes sitting next to each other in an Active/Active cluster but again absolutely no users are ever sent to the secondary box. Yet every day we look at the logs and we see users were on the B boxes.

Although JTAC wants to continue to blame the load balancer we know for sure that is not the case. We think that something is happening on the SA devices (maybe due to a bad configuration???) that is making these users swap back and forth between the boxes.

We have escalated this to the highest level of JTAC via our Sales rep and it has not been rectified. Our end users are screaming as they continue to be dropped "for no reason".

Anyone know what could cause this?

11 REPLIES 11
AJA_
Frequent Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Hello,

I understand your concern and I am very sorry to know what is happening on your cluster and the user experience.

In an effort to be of any help, I have a few questions and something that you may want to try and see if it resolves your issue.

Questions:

- Since when did this issue start?

- Any upgrade on the IVE OS / Any changes on the network / Any changes on the load balancer?

- What load balancer are you using there and have you involved your load balancer team on to this and if yes, what do
they say?

- Do you see any logs on the load balancer regarding this?

- Are there any time pattern you see this issue? (Peak hours / Non-peak hours / Anytime / Always)

- Is there any reason why JTAC feels it's your load balancer issue? (Any specific logs pointing to it?)

I was just wondering if I have ever heard this and yes, I think i remember seeing a scenario like this before.

The solution we tried there was purely by going through the logs from the device and with development help. I would like to share you what we did the other day. However, please be informed that this solution may / may not be similar to your situation - give it a try if you think you would like to.

- SA6500 has something called SSL acceleration chip on the device.

- Disable the acceleration and reboot the machine. Rebooting is very much essential for the changes to take effect.

Impact of disabling SSL acceleration:


- The impact of SSL acceleration on performance varies based on the encryption cipher being used.
- Standard browsers (when doing rewriter) usually use RC4 encryption. The IVE does this encryption in the main
CPU itself (as it is quite light weight). Disabling the accelerator card will not affect performance in such cases.
- However, if you are using a heavier cipher (3DES / AES) and running under load, then the SSL accelerator greatly
offloads the main CPU from the expensive bulk encryption operation. Thus, one can load the system to greater limits
without adversely impacting performance. Typically we have seen that for 3DES encryption, disabling accelerator
card causes the IVE to take about a 30% performance hit, especially if one is doing end-to-end encryption (front-end
and backend SSL).

In summary:

Disabling SSL Hardware Accelerator does not affect 6500 performance (throughput) at the regular packet size of 512 bytes. For small packet sizes, there is a slight degradation for small packet sizes.

Disabling SSL Hardware Accelerator has a 30% degradation for SA6000. The degradation becomes more acute as packet sizes get larger.

NOTE:

Please "DO NOT" disable the SSL hard acceleration if you have many NC (Network Connect) users. Since our NC (Network Connect) client uses 3DES/AES by default, by disabling acceleration, there could be a noticeable performance impact on the system. You may not see a process restart per-se, but packet loss will be greater (for NC) and the page response time will be slower (for core).

Please let me know if the above helps.

Thanks,

GiantYank_
Occasional Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Thanks for your response. Below are answers to your questions and some more detail. I am very eager to get this resolved and will provide any detail you require. Note that as this has been escalated to very high level JTAC engineers - each JTAC engineer has put a lot of time and effort into this but unfortunately the issue is still very prevalent. It is not due to lack of effort from Juniper.

- Since when did this issue start?

  • From the best of our knowledge it has ALWAYS been this way. We are only now getting a great deal of complaints as users who had been migrated from our old IPsec platform to this new platform were really not using this new platform until we disabled their old accounts. So now all users, no matter what, are using the Junipers and the complaints are growing daily.


- Any upgrade on the IVE OS / Any changes on the network / Any changes on the load balancer?

  • No changes to IVE - we are running 7.0R5.1 for many months now. We cannot upgrade any higher as it wreaks havoc with the "logoff on connect" feature that about 50% of our users use. When we upgrade the new version of NC does not keep the setting for "logoff on connect" so all users that have it enabled have to re-enable it. We can't push policy for "logoff on connect" as only 50% use it and we can't hit everyone... Note that I escalated this with JTAC and in the next release they hope to have this issue fixed as well.
  • No changes at all to network.
  • Only changes to Load Balancer was to set persistence - which per Juniper Admin Guide we should have always had on! - and then recently we upped the load balancer inactivity timer to 12 hours instead of 30 minutes


- What load balancer are you using there and have you involved your load balancer team on to this and if yes, what do
they say?

  • We are using F5 load balancers and the F5 team is fully on top of this and they see absolutely no issue with the F5 deployment and we use these F5 devices for a ton of load balancing for many other applications with no issue.


- Do you see any logs on the load balancer regarding this?

  • We do not have F5 logs that show exactly what is happening and we would have to set up a sniffer trace to understand if there is truly an issue but it is very important to understand that we have no issues with load balancing any other servers/apps and more importantly our UK and Singapore solutions are set to 100/0 on the load balancer where if the primary goes down in UK or Singapore the traffic is sent to the United States - not the secondary in the UK or Singapore cluster. We see absolutely no evidence of users in UK or Singapore bouncing back and forth between UK/Singapore and the US. That speaks volumes here as it really points us to the Juniper devices/clusters. The secondary devices in UK and Singapore are not used in any way. shape, or form via our network, load balancer, etc. yet we see users boucning over to them in the Juniper logs.
  • Also, the load balancer logs shows absolutely no instances of any of the monitored devices going down. So the LB is not moving anyone to another IP.


- Are there any time pattern you see this issue? (Peak hours / Non-peak hours / Anytime / Always)

  • Always


- Is there any reason why JTAC feels it's your load balancer issue? (Any specific logs pointing to it?)

  • I'm going to try to sum up what JTAC is saying and I hope I capture it 100% properly... They say that although state tables are shared within a cluster, Juniper has no mechanism for moving users back and forth between the clusters unless one of the clusters goes down. <<tnone of the bosxes are ever down even momentarily per montioring tools, event logs, etc>> So JTAC keeps thinking it is our load balancer even though we supply evidence that it cannot be the load balancer. <<For example - sorry to repeat - LB is set to 100/0 with persistence and inactivity at 12 hours in UK and Singapore and the LB does not even know about the IP addresses of the secondary devices out there and failover is to US. So LB in this situation really can't be involved.>>
  • JTAC keeps thinking that the user - on initial sign-in - resolves the name to one member IP and then somehow starts resolving to the other member IP.
  • They also said that when Host Checker performs its 60 minute interval scan to ensure AV is running and user is still aligned with security policy it could reach out to the user and when the response comes back it could be going to the other cluster member and therefore the member that the user is on thinks it does not get a response and blows the user away. Because they see the users bouncing back and forth between boxes they think this is the cause of the drops.
  • However, the LB now keeps the user going to the same IP for 12 hours with persistence and the maximum session length set in the IVE is 12 hours so during the life of the session the user will resolve to the same IP.
  • On top of that, when a user connects with Network Connect their hosts table is configured by Juniper to use the IP address of the IVE that they connected to. So really, once a user is connected, they should always resolve via their hosts file to the correct IP.

PLEASE NOTE - We have SA 6500s in US and UK and SA 4500s in Singapore and in all cases, nearly all users are using Network Connect so we cannot turn off acceleration per your detail. <<<Please "DO NOT" disable the SSL hard acceleration if you have many NC (Network Connect) users. Since our NC (Network Connect) client uses 3DES/AES by default, by disabling acceleration, there could be a noticeable performance impact on the system. You may not see a process restart per-se, but packet loss will be greater (for NC) and the page response time will be slower (for core).>>>>

HOWEVER:::: Either late last year or early this year we were getting complaints of slow response and JTAC suggested that we turn OFF hardware acceleration. So if your points in your post are accurate, could that be why we are still having slow response reports from our end users???? If all of our users are using Network Connect and we have hardware acceleration disabled are we adding to the slow response misery???? See JTAC response about this below:

It has to do with SSL Acceleration option. This feature is enabled by default on all SA6500 devices. It is used to offload SSL operation from the main CPUÓ

Due to some issues seen in the past with enabling the feature, the vendor recommends disabling it. Below is a statement we received from the vendor:

We are recommending that it be disabled for two reasons, out of an abundance of caution, and secondly, it will have no impact on the functionality of the box, you do not require it for the box to work, but you will potentially see a CPU hit of 20-30% as the device approaches maximum user load. PR for the potential SSL issue is 562746.

Niol_
Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

The secondary devices in UK and Singapore are not used in any way. shape, or form via our network, load balancer, etc. yet we see users boucning over to them in the Juniper logs.

-> You see users boucing in the logs, did you take a network capture on these "inactive" devices to see if there is real user traffic ?

We have SA 6500s in US and UK and SA 4500s in Singapore and in all cases, nearly all users are using Network Connect so we cannot turn off acceleration per your detail.

-> Does DNS resolution of SA cluster gives the same result before and after the NC tunnel is established?

Kita_
Valued Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Hello GiantYank,

It may be a little difficult to troubleshoot this issue without viewing the logs. Can you provide me the case number and I will try and follow up internally to see what I can do to help? Thanks

GiantYank_
Occasional Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Niol - Thanks for the response.

  • JTAC has asked for tcpdumps, user logs, event logs, user traces, etc. from the IVEs and our vendor who manages the devices sent everything to JTAC for review. I'm not sure if they specifically looked at the tcpdumps to verify the swapping of users but JTAC acknowledges that they see the issue. I'm going to ask the vendor to send me the tcpdumps and see if I can track it. Note that JTAC also asked for and received many wireshark captures and client-side logs from several end users as well.
  • The same load balancers resolve the URLs for users internally and externally and we have tested extensively and also escalated to F5 to ensure that this is not an F5 issue. In all testing the we absolutely resolve to the same IP we were originally resolved to both internally and externally.
GiantYank_
Occasional Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Kita - The ticket opened with JTAC is 2011-1026-0834. There were several opened by our vendor to Juniper and I believe that this one has the latest and greatest detail. Getting very frustrating as after all of this time and all of the conf calls, testing with the load balancers, escalation to F5 to ensure that the load balancers are perfectly configured and working perfectly, and emails around this we received another email from the JTAC engineer working this blaming it on the load balancers again. See below as it has some detail that may be helpful to anyone tracking these posts. I removed names for privacy and security reasons.

I know you are new to this case but we have beaten the persistence setting on the load balancerÓ topic to death already. It has been set to persistence for over a month now and JTAC requested that the inactivity time out for the persistence on the load balancer be set to 12 hours instead of the 30 minutes that we had it set to previously. That was implemented on Tuesday. Persistence is absolutely, positively, without a doubt, enabled. See screenshot below and note that currently there are 2,774 persistence records for ("URL") RIGHT nowÉ ($ grep "pool \"("URL")" ldns | wc -l >> 2774).

Furthermore, as has been stated several times already, the (other region ive) & (other region ive)Singapore are experiencing the same exact issue and they are set to 100/0 with persistence and inactivity time out set to 12 hours and the failover is the (main region ive) (not the secondary device in those locations). The load balancer has absolutely no knowledge of the secondary boxes in the (other region ive) and (other region ive). Yet, somehow, someway, the users are bouncing back and forth between the primary and secondary IVEs in the (other region ive) and (other region ive) per the logs as well. This is NOT a load balancer persistence issue. I agree with xxxxxx Ð we need to have yet another call on this as we keep going around in circles.

zanyterp_
Respected Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Have you had time to work with any of your users that you consistently see bounce between nodes to set a static hosts entry? I know you have many users on this and it may not be possible, but that will absolutely remove the load balancer out of the equation
jjh_
Occasional Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

I've worked with Juniper's ssl vpn and F5's load balancers for years and have seen similar things.

One thing to keep in mind is that there is a difference between F5 persistence and timeout.

Also, although an F5 may not be aware of a device, if the F5 is setup L3, then all traffic passes through the F5 and as such timeouts apply. The default F5 timeout is 5 minutes, though we change that to 30 to match our firewall timeouts.

Ask your F5 guys to watch a particular user via "b conn client x.x.x.x server x.x.x.x show all". Look at the last line of output for "IDLE".

My guess is that you will be able to fix this via changing F5 profile timeout values. Have the F5 guys be as granular here as possible so that they don't change global profiles like tcp or fastL4 b/c that will affect the entire F5. Best case would be to attach an irule to the F5 vip for a 12 hour timeout:

when SERVER_CONNECTED {
IP::idle_timeout 43200
}

GiantYank_
Occasional Contributor

Re: Critical problem with SSL VPN cluster with users switching between SA devices

Thanks - our firewall team - who supports the F5s - worked with F5 support to be sure that we are configured properly and to review the logs in great detail. It is set for persistence and the inactivity is at 12 hours (43,200 seconds) as you suggest. It had been set to 30 minutes before. It didn't make the slightest difference.