Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?
- by Quinn Murphy
I had a scenario where a DRBD-heartbeat set up had a failed node but did not failover. What happened was the primary node had locked up, but didn't go down directly (it was inaccessible via ssh or with the nfs mount, but it could be pinged). The desired behavior would have been to detect this and failover to the secondary node, but it appears that since the primary didn't go full down (there is a dedicated network connection from server to server), heartbeat's detection mechanism didn't pick up on that and therefore didn't failover.
Has anyone seen this? Is there something that I need to configure to have more robust cluster failover? DRBD seems to otherwise work fine (had to resync when I rebooted the old primary), but without good failover, it's use is limited.
heartbeat 3.0.4
drbd84
RHEL 6.1
We are not using Pacemaker
nfs03 is the primary server in this setup, and nfs01 is the secondary.
ha.cf
# Hearbeat Logging
logfacility daemon
udpport 694
ucast eth0 192.168.10.47
ucast eth0 192.168.10.42
# Cluster members
node nfs01.openair.com
node nfs03.openair.com
# Hearbeat communication timing.
# Sets the triggers and pulse time for swapping over.
keepalive 1
warntime 10
deadtime 30
initdead 120
#fail back automatically
auto_failback on
and here is the haresources file:
nfs03.openair.com IPaddr::192.168.10.50/255.255.255.0/eth0 drbddisk::data Filesystem::/dev/drbd0::/data::ext4 nfs nfslock