I am currently making use of two SA2000 units in a failover cluster (active passive). The units work wonderfully until I try to failover the cluster to the secondary unit. Firmware is 7.0R4 (build 17289).
Once the second unit takes over any NC clients who connect cannot access any of our subnets bar one, the concentrators are plugged into the same switch. I have triple checked all the setups but I cannot find any problems that I know of.
I cannot easily test and play with the second unit because it obviously effects my clients quite badly and if I take it out of the cluster it has no network connect license for stand alone use.
Where would be the best place to start looking to track down this problem.
One place you could start is running tcpdump off of both IVE's. You could first verify that the external packets are reaching the active IVE, and then verify that the internal request and return packets are being sourced and received on the active IVE as well.
The problem might be related to stale MAC entries for the internally assigned NC IP's on the directly connected internal switch once a failover is triggered.
The biggest problem is that I cannot fail over the cluster currently due to the effect on the users which makes testing extremely hard, so I am thinking of getting a temporary license from Juniper support for the second unit and performing some dumps etc.
The directly connected switch is a Cisco 2960/12.2(40) so it could possibly be an ARP related issue but as above I think I should probably just open a ticket so I can troubleshoot the issue without consequesce.
Did you try connecting directly to passive unit? This will help isolate if the issue occurs only during failover or even when users start new sessions on passive units. If its the later case then you will be able to troubleshoot without having to failover.
The problem by doing that is that I cannot connect directly to it as once it is taken out of the cluster it has no NC license. And I don't see how I can connect to the passive unit while it is in the cluster since it isnt active.
You can connect directly to the passive node while it is in the cluster by connecting to the IP address of the node.
Do you know if there is a route on the internal network points to the cluster VIP (good) or a specific node internal port IP (not good)?
The internal network routes all point to the cluster VIP correctly not either of the nodes so I do not think this is the issue. I will try connect to the IP and see how it goes.
Ive now had time to get around to connecting to the second concentrator and also running some quick packet dumps on every stage of the network I have confirmed the following.
Any ideas apreciated.
What happens if you exit Network Connect on the client machine after the failover and restart it?
Also, it might be interesting to force NCP instead of ESP and see if the same behavior occurs. It sounds from your description of your test that the packets are getting routed correctly after the failover, but that the backup machine fails to send it to the PC. This makes me wonder if there is some sort of key mismatch on the outbound ESP traffic.
Exiting NC after failover does not help sadly, have tried that a few times in the past.
As for NCP and ESP transport modes...they don't seem to have any effect on failover or NC working on the second concentrator. I think I am now at the point where I should probably just open a ticket and see what Juniper support think directly.
Will update this thread on how that works out.