SBD killing both cluster nodes when there are even small SAN network problems
- by Wieslaw Herr
I am having problems with stonith SBD in a openais-based cluster.
Some background:
The active/passive cluster has two nodes, node1 and node2. They are configured to provide an NFS service to users. To avoid problems with split-brain, they are both configured to use SBD. SBD is using two 1MB disks available to the hosts via an multipath fibre-channel network.
The problems start if something happens with the SAN network. For example, today one of the brocade switches got rebooted and both nodes lost 2 out of 4 paths to each disks, which resulted in both nodes committing suicide and rebooting. This, of course, was highly undesirable because a) there were paths left b) even if the switch would be out for 10-20 seconds a reboot cycle of both nodes would take 5-10 minutes and all NFS-locks would be lost.
I tried increasing the SBD timeout values (to 10sec+ values, dump attached at the end), however a "WARN: Latency: No liveness for 4 s exceeds threshold of 3 s" hints that something isn't working as I would it expect to.
Here is what I would like to know:
a) Is SBD working as it should killing nodes when 2 paths are available?
b) If not, is the multipath.conf file attached correct? The storage controller we use is an IBM SVC (IBM 2145), should there be any specific configuration for it? (as in multipath.conf.defaults)
c) How should I go about increasing the timeouts in SBD
attachements:
Multipath.conf and sbd dump (http://hpaste.org/69537)