We have a pair of SA 6000 running in an Active-Passive Cluster on version 6.1R2-1 (build 13103). A few months ago we had a problem where both boxes would suddenly crash. Nothing in the logs and the boxes needed to be powered off and back on to restart them. We opened a ticket with Juniper, set up remote debuging, packet sniffers etc but after 3 weeks the problem didn't re-occur. Today the problem has just happened again. Anyone seen anything like this, any ideas?
We are running mainly Network Connect with some WSAM and file and web access. We also use the Support Meeting feautre.
I have seen issues when a cluster loses contact with it's peer. In my case, it lockes up the network connect users and will not hand out IP addresses any more. Remote access comes to a halt but admin access to the server still works.
If the cluster is located in the same DC, make sure you have good communications between them. if they are remote, make sure it's not more then 100ms between them.
When our issue occurs both appliances are totally dead, not even responding on console port. The last messages in the log are on appliance seeing the other go down and taking over then that one immediately dies.
We've since tried running with one appliance shutdown and had the same problem so it looks like it is not actually related to the clustering. Alas once again as soon as we get set up to capture logs, debugging we don't get the issue.
We just encountered this exact same issue with our SA-6000 units. They're running 6.3R3 at this time and have been performing fine for years on this and other versions of IVE OS. Then, last night, they fell over so hard they wouldn't respond even to console access. No user page, no admin page, no ARP for the individual units or the cluster VIP in our core's ARP tables. We brought them back with a manual power cycle.
Then they fell over again this morning. More worrisome, my redundant unit in the DR datacenter (not part of this cluster, runs the same IVE OS) fell over this morning at the same time as the clustered units in my primary facility.
JTAC has had us drop the passive member and rebuild the cluster, but have no explanation for this event.