Search Results

Search found 24 results on 1 pages for 'pacemaker'.

Page 1/1 | 1 

  • ldirectord ipvsadm not show reals ip and not work wtih pacemaker and corosync

    - by miguer27
    first thanks for your time. I'm having a problem with ldirectord that I can not solve, I comment my situation: I have two nodes with pace maker and corosync and configure somes resources: root@ldap1:/home/mamartin# crm status Last updated: Tue Jun 3 12:58:30 2014 Last change: Tue Jun 3 12:23:47 2014 via cibadmin on ldap1 Stack: openais Current DC: ldap2 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 7 Resources configured. Online: [ ldap1 ldap2 ] Resource Group: IPV_LVS IPV_4 (ocf::heartbeat:IPaddr2): Started ldap1 IPV_6 (ocf::heartbeat:IPv6addr): Started ldap1 lvs (ocf::heartbeat:ldirectord): Started ldap1 Clone Set: clon_IPV_lo [IPV_lo] Started: [ ldap2 ] Stopped: [ IPV_lo:1 ] root@ldap1:/home/mamartin# crm configure show node ldap2 \ attributes standby="off" node ldap1 \ attributes standby="off" primitive IPV-lo_4 ocf:heartbeat:IPaddr \ params ip="192.168.1.10" cidr_netmask="32" nic="lo" \ op monitor interval="5s" primitive IPV-lo_6 ocf:heartbeat:IPv6addrLO \ params ipv6addr="[fc00:1::3]" cidr_netmask="64" \ op monitor interval="5s" primitive IPV_4 ocf:heartbeat:IPaddr2 \ params ip="192.168.1.10" nic="eth0" cidr_netmask="25" lvs_support="true" \ op monitor interval="5s" primitive IPV_6 ocf:heartbeat:IPv6addr \ params ipv6addr="[fc00:1::3]" nic="eth0" cidr_netmask="64" \ op monitor interval="5s" primitive lvs ocf:heartbeat:ldirectord \ params configfile="/etc/ldirectord.cf" \ op monitor interval="20" timeout="10" \ meta target-role="Started" group IPV_LVS IPV_4 IPV_6 lvs group IPV_lo IPV-lo_6 IPV-lo_4 clone clon_IPV_lo IPV_lo \ meta interleave="true" target-role="Started" location cli-prefer-IPV_LVS IPV_LVS \ rule $id="cli-prefer-rule-IPV_LVS" inf: #uname eq ldap1 colocation LVS_no_IPV_lo -inf: clon_IPV_lo IPV_LVS property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ last-lrm-refresh="1401264327" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" The problem is in the ipvsadm only show a one real IP, when i configured two now, show the ldirector.cf: root@ldap1:/home/mamartin# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags - RemoteAddress:Port Forward Weight ActiveConn InActConn TCP ldap-maqueta.cica.es:ldap wrr - ldap2.cica.es:ldap Route 4 0 0 TCP [[fc00:1::3]]:ldap wrr - [[fc00:1::2]]:ldap Route 4 0 0 root@ldap1:/home/mamartin# cat /etc/ldirectord.cf checktimeout=10 checkinterval=2 autoreload=yes logfile="/var/log/ldirectord.log" quiescent=yes #ipv4 virtual=192.168.1.10:389 real=192.168.1.11:389 gate 4 real=192.168.1.12:389 gate 4 scheduler=wrr protocol=tcp checktype=on #ipv6 virtual6=[[fc00:1::3]]:389 real6=[[fc00:1::1]]:389 gate 4 real6=[[fc00:1::2]]:389 gate 4 scheduler=wrr protocol=tcp checkport=389 checktype=on and in the logs I see nothing clear: root@ldap1:/home/mamartin# ldirectord -d /etc/ldirectord.cf start DEBUG2: Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.11:389 -g -w 0) Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.11:389 -g -w 0) DEBUG2: Quiescent real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 0) Quiescent real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 0) DEBUG2: Disabled real server=on:tcp:192.168.1.11:389:::4:gate:\/: (virtual=tcp:192.168.1.10:389) DEBUG2: Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 0) Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 0) DEBUG2: Quiescent real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 0) Quiescent real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 0) DEBUG2: Disabled real server=on:tcp:192.168.1.12:389:::4:gate:\/: (virtual=tcp:192.168.1.10:389) DEBUG2: Checking on: Real servers are added without any checks DEBUG2: Resetting soft failure count: 192.168.1.12:389 (tcp:192.168.1.10:389) Resetting soft failure count: 192.168.1.12:389 (tcp:192.168.1.10:389) DEBUG2: Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 4) Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 4) Destination already exists root@ldap1:/home/mamartin# cat /var/log/ldirectord.log [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Quiescent real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 0) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Quiescent real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 0) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: 192.168.1.12:389 (tcp:192.168.1.10:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 4) failed: [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Added real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 4) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: 192.168.1.11:389 (tcp:192.168.1.10:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Restored real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 4) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: [[fc00:1::2]]:389 (tcp:[[fc00:1::3]]:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] system(/sbin/ipvsadm -a -t [[fc00:1::3]]:389 -r [[fc00:1::2]]:389 -g -w 4) failed: [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Added real server: [[fc00:1::2]]:389 ([[fc00:1::3]]:389) (Weight set to 4) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: [[fc00:1::1]]:389 (tcp:[[fc00:1::3]]:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Restored real server: [[fc00:1::1]]:389 ([[fc00:1::3]]:389) (Weight set to 4) do not know if this is a bug or a configuration error, can anyone help? Regards.

    Read the article

  • does heartbeat v3 support same resource agent types of pacemaker?

    - by Emre He
    As we know, Pacemaker supports three types of Resource Agents, ? LSB Resource Agents, ? OCF Resource Agents, ? legacy Heartbeat Resource Agents http://www.linux-ha.org/wiki/Resource_Agents does heartbeat v3 support above 3 types resource agent? or it only support LSB and legacy heartbeat resource agents? because we have only virtual ip and one service need to switch in ha cluster, so we decide not involve pacemaker, so we come to this question, for example we cannot monitor the application service by heartbeat, heartbeat only can handle to start it on active node. thanks, Emre

    Read the article

  • Linux HA cluster w/Xen, Heartbeat, Pacemaker. domU does not failover to secondary node

    - by Kendall
    I am having the followig problem with an OenSuSE + Heartbeat + Pacemaker + Xen HA cluster: when the node a Xen domU is running on is "dead" the Xen domU running on it is not restarted on the second node. The cluster is setup with two nodes, each running OpenSuSE-11.3, Heartbeat 3.0, and Pacemaker 1.0 in CRM mode. For storage I am using a LUN on an iSCSI SAN device; the LUN is formatted with OCFS2 and managed with LVM. The Xen domU has two logical volumes; one for root and the other for swap. I am using IPMI cards for STONITH devices, and a dedicated ethernet link for heartbeat communications. The ha.cf file is as follows: keepalive 1 deadtime 10 warntime 5 udpport 694 ucast eth1 auto_failback off node dhcp-166 node stage use_logd yes crm yes My resources look as follows: shocrm(live)configure# show node $id="5c1aa924-bba4-4f95-a367-6c9a58ac4a38" dhcp-166 node $id="cebc92eb-af24-4833-aaf0-672adf80b58e" stage primitive Xen-Util ocf:heartbeat:Xen \ meta target-role="Started" \ operations $id="Xen-Util-operations" \ op start interval="0" timeout="60" start-delay="0" \ op stop interval="0" timeout="120" \ params xmfile="/etc/xen/vm/xen-util" primitive my-stonith stonith:external/ipmi \ params hostname="dhcp-166" ipaddr="192.168.3.106" userid="ADMIN" passwd="xxx" \ op monitor interval="2m" timeout="60s" primitive my-stonith2 stonith:external/ipmi \ params hostname="stage" ipaddr="192.168.3.105" userid="ADMIN" passwd="xxx" \ op monitor interval="2m" timeout="60s" property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" The Xen domU config file is as follows: name = "xen-util" bootloader = "/usr/lib/xen/boot/domUloader.py" #bootargs = "xvda1:/vmlinuz-xen,/initrd-xen" bootargs = "--entry=xvda1:/boot/vmlinuz-xen,/boot/initrd-xen" memory = 4096 disk = [ 'phy:vg_xen/xen-util-root,xvda1,w', 'phy:vg_xen/xen-util-swap,xvda2,w', ] root = "/dev/xvda1" vif = [ 'mac=00:16:3e:42:42:06' ] #vfb = [ 'type=vnc,vncunused=0,vnclisten=192.168.3.172' ] extra = "" Say domU "Xen-Util" is running on node "stage"; if "stage" goes down, "Xen-Util" does not restart on node "dhcp-166". It seems to want to try as an "xm list" will show it for a few seconds and if you "xm console xen-util" it will give a message like "copying /boot/kernel.gz from xvda1 to /var/lib/xen/tmp/kernel.a53gs for booting". However, it never gets past that, eventually gives up, and no longer appears in "xm list". Now, when node "stage" comes back online after being power cycled, it detects that "Xen-Util" isn't running, and starts it (on stage). I've tried starting "Xen-Util" on node "dhcp-166" without the cluster running, and it works fine. No problems. So, I know it works in that respect. Any ideas? Thanks!

    Read the article

  • MySQL: Pacemaker cannot start the failed master as a new slave?

    - by quanta
    I'm going to setup failover for MySQL replication (1 master and 1 slave) follow this guide: https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst Here're the output of crm configure show: node serving-6192 \ attributes p_mysql_mysql_master_IP="192.168.6.192" node svr184R-638.localdomain \ attributes p_mysql_mysql_master_IP="192.168.6.38" primitive p_mysql ocf:percona:mysql \ params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock" replication_user="repl" replication_passwd="x" test_user="test_user" test_passwd="x" \ op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \ op monitor interval="2s" role="Slave" timeout="30s" OCF_CHECK_LEVEL="1" \ op start interval="0" timeout="120s" \ op stop interval="0" timeout="120s" primitive writer_vip ocf:heartbeat:IPaddr2 \ params ip="192.168.6.8" cidr_netmask="32" \ op monitor interval="10s" \ meta is-managed="true" ms ms_MySQL p_mysql \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true" colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start property $id="cib-bootstrap-options" \ dc-version="1.0.12-unknown" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ last-lrm-refresh="1341801689" property $id="mysql_replication" \ p_mysql_REPL_INFO="192.168.6.192|mysql-bin.000006|338" crm_mon: Last updated: Mon Jul 9 10:30:01 2012 Stack: openais Current DC: serving-6192 - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ serving-6192 svr184R-638.localdomain ] Master/Slave Set: ms_MySQL Masters: [ serving-6192 ] Slaves: [ svr184R-638.localdomain ] writer_vip (ocf::heartbeat:IPaddr2): Started serving-6192 Editing /etc/my.cnf on the serving-6192 of wrong syntax to test failover and it's working fine: svr184R-638.localdomain being promoted to become the master writer_vip switch to svr184R-638.localdomain Current state: Last updated: Mon Jul 9 10:35:57 2012 Stack: openais Current DC: serving-6192 - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ serving-6192 svr184R-638.localdomain ] Master/Slave Set: ms_MySQL Masters: [ svr184R-638.localdomain ] Stopped: [ p_mysql:0 ] writer_vip (ocf::heartbeat:IPaddr2): Started svr184R-638.localdomain Failed actions: p_mysql:0_monitor_5000 (node=serving-6192, call=15, rc=7, status=complete): not running p_mysql:0_demote_0 (node=serving-6192, call=22, rc=7, status=complete): not running p_mysql:0_start_0 (node=serving-6192, call=26, rc=-2, status=Timed Out): unknown exec error Remove the wrong syntax from /etc/my.cnf on serving-6192, and restart corosync, what I would like to see is serving-6192 was started as a new slave but it doesn't: Failed actions: p_mysql:0_start_0 (node=serving-6192, call=4, rc=1, status=complete): unknown error Here're snippet of the logs which I'm suspecting: Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: rsc:p_mysql:0:4: start Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: RA output: (p_mysql:0:start:stderr) Error performing operation: The object/attribute does not exist Jul 09 10:46:32 serving-6192 crm_attribute: [7420]: info: Invoked: /usr/sbin/crm_attribute -N serving-6192 -l reboot --name readable -v 0 The full logs: http://fpaste.org/AyOZ/ The strange thing is I can starting it manually: export OCF_ROOT=/usr/lib/ocf export OCF_RESKEY_config="/etc/my.cnf" export OCF_RESKEY_pid="/var/run/mysqld/mysqld.pid" export OCF_RESKEY_socket="/var/lib/mysql/mysql.sock" export OCF_RESKEY_replication_user="repl" export OCF_RESKEY_replication_passwd="x" export OCF_RESKEY_test_user="test_user" export OCF_RESKEY_test_passwd="x" sh -x /usr/lib/ocf/resource.d/percona/mysql start: http://fpaste.org/RVGh/ Did I make something wrong?

    Read the article

  • High Availability Configuration using Heartbeat and Pacemaker

    - by pradeepchhetri
    I have the following setup: I have configured high availability between two load balancers (HAProxy) so that if HAProxy1 get down, the floating IP gets transferred to the other load balancer HAProxy2, hence all the clients will get the response from HAProxy2, which at the back-end is doing LB among the sme two webserver. This is for removing the single point of failure in case of only one HAProxy. Whenever I stops the hearbeat in HAProxy1, the floating IP goes to HAProxy2. But I want to configure such that whenever the process haproxy goes down, the floating IP should get assigned to HAProxy2. Can someone tell me how to implement it ?

    Read the article

  • Redhat cluster Vs Pacemaker Vs Gluster Vs Sheepdog

    - by chandank
    Changing the entire question as earlier one was very confusing. I have been exploring different clustering system to run Virtual machines on two different machines on LAN with high availability. Currently I am already using DRBD resource on two different machines on Primary/Secondary mode. In case the primary fails I manually promote the secondary to Primary and start the VM. I also explored Gluster and looks good, however, I would rather prefer clustering over Gluster (user space FS). So if anyone has idea which one would be better from ease of use prospective please I would be interested in. Moreover, sheepdog project appears good, however, could not find much documentations/Howtos. I am using Centos 6.

    Read the article

  • Active node stops resources when pasive node is shutdown

    - by Wakaru44
    2 nodes, active/pasive. 2 resources, a virtual ip, openLdap, and the nfs mount where openldap saves the data. When both nodes are up, things worked fine. You could move resources away and put the active in stanby. But when i rebooted the passive node, ( with the resources in the active node), and the passive node loses conectivity, all the resources in the active where stopped by pacemaker. I'm reading the documentation right now, but I just need a little quick tip to figure what could be hapenning here. Im using: corosync pacemaker RHEL 6

    Read the article

  • NFS v4, HA Migration, and stale handles on clients

    - by Karl Katzke
    I'm managing a server running NFS v4 with Pacemaker/OpenAIS. NFS is configured to use TCP. When I migrate the NFS server to another node in the Pacemaker cluster, even though the metadata is persisted, connections from the clients 'hang' and eventually time out after 90 seconds. After that 90 seconds, the old mountpoint becomes 'stale' and the mounted files can no longer be accessed. The 90 second grace period seems to be part of the server configuration and not the client configuration. I see this message on the server: kernel: NFSD: starting 90-second grace period If I restart the NFS client on the client nodes after I migrate (unmounting and then remounting the share), then I don't experience the problem, but connections and file transfers still interrupted. Three questions: What is the 90 second grace period? What's it there for? How can I keep the files from going stale on the clients without restarting them after I migrate the NFS server to another node? Is it actually possible to migrate the NFS server without having large file uploads drop?

    Read the article

  • Corosync :: Restarting some resources after Lan connectivity issue

    - by moebius_eye
    I am currently looking into corosync to build a two-node cluster. So, I've got it working fine, and it does what I want to do, which is: Lost connectivity between the two nodes gives the first node '10node' both Failover Wan IPs. (aka resources WanCluster100 and WanCluster101 ) '11node' does nothing. He "thinks" he still has his Failover Wan IP. (aka WanCluster101) But it doesn't do this: '11node' should restart the WanCluster101 resource when the connectivity with the other node is back. This is to prevent a condition where node10 simply dies (and thus does not get 11node's Failover Wan IP), resulting in a situation where none of the nodes have 10node's failover IP because 10node is down 11node has "given back" his failover Wan IP. Here's the current configuration I'm working on. node 10sch \ attributes standby="off" node 11sch \ attributes standby="off" primitive LanCluster100 ocf:heartbeat:IPaddr2 \ params ip="172.25.0.100" cidr_netmask="32" nic="eth3" \ op monitor interval="10s" \ meta is-managed="true" target-role="Started" primitive LanCluster101 ocf:heartbeat:IPaddr2 \ params ip="172.25.0.101" cidr_netmask="32" nic="eth3" \ op monitor interval="10s" \ meta is-managed="true" target-role="Started" primitive Ping100 ocf:pacemaker:ping \ params host_list="192.0.2.1" multiplier="500" dampen="15s" \ op monitor interval="5s" \ meta target-role="Started" primitive Ping101 ocf:pacemaker:ping \ params host_list="192.0.2.1" multiplier="500" dampen="15s" \ op monitor interval="5s" \ meta target-role="Started" primitive WanCluster100 ocf:heartbeat:IPaddr2 \ params ip="192.0.2.100" cidr_netmask="32" nic="eth2" \ op monitor interval="10s" \ meta target-role="Started" primitive WanCluster101 ocf:heartbeat:IPaddr2 \ params ip="192.0.2.101" cidr_netmask="32" nic="eth2" \ op monitor interval="10s" \ meta target-role="Started" primitive Website0 ocf:heartbeat:apache \ params configfile="/etc/apache2/apache2.conf" options="-DSSL" \ operations $id="Website-one" \ op start interval="0" timeout="40" \ op stop interval="0" timeout="60" \ op monitor interval="10" timeout="120" start-delay="0" statusurl="http://127.0.0.1/server-status/" \ meta target-role="Started" primitive Website1 ocf:heartbeat:apache \ params configfile="/etc/apache2/apache2.conf.1" options="-DSSL" \ operations $id="Website-two" \ op start interval="0" timeout="40" \ op stop interval="0" timeout="60" \ op monitor interval="10" timeout="120" start-delay="0" statusurl="http://127.0.0.1/server-status/" \ meta target-role="Started" group All100 WanCluster100 LanCluster100 group All101 WanCluster101 LanCluster101 location AlwaysPing100WithNode10 Ping100 \ rule $id="AlWaysPing100WithNode10-rule" inf: #uname eq 10sch location AlwaysPing101WithNode11 Ping101 \ rule $id="AlWaysPing101WithNode11-rule" inf: #uname eq 11sch location NeverLan100WithNode11 LanCluster100 \ rule $id="RAND1083308" -inf: #uname eq 11sch location NeverPing100WithNode11 Ping100 \ rule $id="NeverPing100WithNode11-rule" -inf: #uname eq 11sch location NeverPing101WithNode10 Ping101 \ rule $id="NeverPing101WithNode10-rule" -inf: #uname eq 10sch location Website0NeedsConnectivity Website0 \ rule $id="Website0NeedsConnectivity-rule" -inf: not_defined pingd or pingd lte 0 location Website1NeedsConnectivity Website1 \ rule $id="Website1NeedsConnectivity-rule" -inf: not_defined pingd or pingd lte 0 colocation Never -inf: LanCluster101 LanCluster100 colocation Never2 -inf: WanCluster100 LanCluster101 colocation NeverBothWebsitesTogether -inf: Website0 Website1 property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ last-lrm-refresh="1408954702" \ maintenance-mode="false" rsc_defaults $id="rsc-options" \ resource-stickiness="100" \ migration-threshold="3" I also have a less important question concerning this line: colocation NeverBothLans -inf: LanCluster101 LanCluster100 How do I tell it that this collocation only applies to '11node'.

    Read the article

  • Linux stretch cluster: MD replication, DRBD or Veritas?

    - by PieterB
    For the moment there's a lot of choices for setting up a Linux cluster. For cluster manager: you can use Red Hat Cluster manager, Pacemaker or Veritas Cluster Server. The first one has the most momentum, the second one comes by default with RH subscriptions and the last one is very expensive and has a very good reputation ;-) For storage: - You can replicate LUN's using software raid / md device - You can use the network using DRBD replication, which offers a bit more flexibility - You can use Veritas Storage Foundation technology to talk to your SANs replication technology. Anyone has any recommandations or experience with these technologies?

    Read the article

  • Linux stretch cluster: MD replication, DRBD or Veritas?

    - by PieterB
    For the moment there's a lot of choices for setting up a Linux cluster. For cluster manager: you can use Red Hat Cluster manager, Pacemaker or Veritas Cluster Server. The first one has the most momentum, the second one comes by default with RH subscriptions and the last one is very expensive and has a very good reputation ;-) For storage: - You can replicate LUN's using software raid / md device - You can use the network using DRBD replication, which offers a bit more flexibility - You can use Veritas Storage Foundation technology to talk to your SANs replication technology. Anyone has any recommandations or experience with these technologies?

    Read the article

  • Corosync - stopping the service crashes the server

    - by Antipop
    I am trying to set up a test cluster on a Xen Server with 2 paravirtualized CentOS 5.4 machines. I am using Pacemaker+Corosync, and following the instructions found at http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf and other sites. Anyway, when I try to manually stop the corosync service, about 80% of the times the whole VM locks up with the message "Waiting for corosync services to unload" and I am forced to shut the machine down manually. For the remaining 20%, the VM keeps responding and adds dots to the above message, but it won't actually stop the service. There aren't many resources on the internet about this particular error. Any ideas about this? Thanks in advance.

    Read the article

  • IP failover with 2 nodes on different subnet: cannot ping virtual IP from second node?

    - by quanta
    I'm going to setup redundant failover Redmine: another instance was installed on the second server without problem MySQL (running on the same machine with Redmine) was configured as master-master replication Because they are in different subnet (192.168.3.x and 192.168.6.x), it seems that VIPArip is the only choice. /etc/ha.d/ha.cf on node1 logfacility none debug 1 debugfile /var/log/ha-debug logfile /var/log/ha-log autojoin none warntime 3 deadtime 6 initdead 60 udpport 694 ucast eth1 node2.ip keepalive 1 node node1 node node2 crm respawn /etc/ha.d/ha.cf on node2: logfacility none debug 1 debugfile /var/log/ha-debug logfile /var/log/ha-log autojoin none warntime 3 deadtime 6 initdead 60 udpport 694 ucast eth0 node1.ip keepalive 1 node node1 node node2 crm respawn crm configure show: node $id="6c27077e-d718-4c82-b307-7dccaa027a72" node1 node $id="740d0726-e91d-40ed-9dc0-2368214a1f56" node2 primitive VIPArip ocf:heartbeat:VIPArip \ params ip="192.168.6.8" nic="lo:0" \ op start interval="0" timeout="20s" \ op monitor interval="5s" timeout="20s" depth="0" \ op stop interval="0" timeout="20s" \ meta is-managed="true" property $id="cib-bootstrap-options" \ stonith-enabled="false" \ dc-version="1.0.12-unknown" \ cluster-infrastructure="Heartbeat" \ last-lrm-refresh="1338870303" crm_mon -1: ============ Last updated: Tue Jun 5 18:36:42 2012 Stack: Heartbeat Current DC: node2 (740d0726-e91d-40ed-9dc0-2368214a1f56) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 1 Resources configured. ============ Online: [ node1 node2 ] VIPArip (ocf::heartbeat:VIPArip): Started node1 ip addr show lo: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet 192.168.6.8/32 scope global lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever I can ping 192.168.6.8 from node1 (192.168.3.x): # ping -c 4 192.168.6.8 PING 192.168.6.8 (192.168.6.8) 56(84) bytes of data. 64 bytes from 192.168.6.8: icmp_seq=1 ttl=64 time=0.062 ms 64 bytes from 192.168.6.8: icmp_seq=2 ttl=64 time=0.046 ms 64 bytes from 192.168.6.8: icmp_seq=3 ttl=64 time=0.059 ms 64 bytes from 192.168.6.8: icmp_seq=4 ttl=64 time=0.071 ms --- 192.168.6.8 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3000ms rtt min/avg/max/mdev = 0.046/0.059/0.071/0.011 ms but cannot ping virtual IP from node2 (192.168.6.x) and outside. Did I miss something? PS: you probably want to set IP2UTIL=/sbin/ip in the /usr/lib/ocf/resource.d/heartbeat/VIPArip resource agent script if you get something like this: Jun 5 11:08:10 node1 lrmd: [19832]: info: RA output: (VIPArip:stop:stderr) 2012/06/05_11:08:10 ERROR: Invalid OCF_RESK EY_ip [192.168.6.8] http://www.clusterlabs.org/wiki/Debugging_Resource_Failures Reply to @DukeLion: Which router receives RIP updates? When I start the VIPArip resource, ripd was run with below configuration file (on node1): /var/run/resource-agents/VIPArip-ripd.conf: hostname ripd password zebra debug rip events debug rip packet debug rip zebra log file /var/log/quagga/quagga.log router rip !nic_tag no passive-interface lo:0 network lo:0 distribute-list private out lo:0 distribute-list private in lo:0 !metric_tag redistribute connected metric 3 !ip_tag access-list private permit 192.168.6.8/32 access-list private deny any

    Read the article

  • Linux HA / cluster: what are the differences between Pacemaker, Heartbeat, Corosync, wackamole?

    - by Continuation
    Can you help me understand Linux HA? Pacemaker, Heartbeat, Corosync seem to be part of a whole HA stack, but how do they fit together? How does wackamole differ from Pacemaker/Heartbeat/Corosync? I've seen opinions that wackamole is better than Heartbeat because it's peer-based. Is that valid? The last release of wackamole was 2.5 years ago. Is it still being maintained or active? What would you recommend for HA setup for web/application/database servers?

    Read the article

  • Avoiding DNS timeouts when a dns server fails

    - by user65124
    Hi there. We have a small datacenter with about a hundred hosts pointing to 3 internal dns servers (bind 9). Our problem comes when one of the internal dns servers becomes unavailable. At that point all the clients that point to that server start performing very slowly. The problem seems to be that the stock linux resolver doesn't really have the concept of "failing over" to a different dns server. You can adjust the timeout and number of retries it uses, (and set rotate so it will work through the list), but no matter what settings one uses our services perform much more slowly if a primary dns server becomes unavailable. At the moment this is one of the largest sources of service disruptions for us. My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...", but if that's an option I haven't seen it. I was wondering how other folks handled this issue? I can see 3 possible types of solutions: Use linux-ha/Pacemaker and failover ips (so the dns IP VIPs are "always" available). Alas, we don't have a good fencing infrastructure, and without fencing pacemaker doesn't work very well (in my experience Pacemaker lowers availability without fencing). Run a local dns server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage. Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd seems to have the right feature set: it marks dns servers as up or down, and won't use 'down' dns servers. Any-casting seems to work only at the ip routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does not support broadcasting or multi-casting, and the docs I could find seem to suggest that multicast dns is more aimed at service discovery and auto-configuration rather than regular dns resolving. Am I missing an obvious solution?

    Read the article

  • Avoiding DNS timeouts when a dns server fails

    - by Neil Katin
    We have a small datacenter with about a hundred hosts pointing to 3 internal dns servers (bind 9). Our problem comes when one of the internal dns servers becomes unavailable. At that point all the clients that point to that server start performing very slowly. The problem seems to be that the stock linux resolver doesn't really have the concept of "failing over" to a different dns server. You can adjust the timeout and number of retries it uses, (and set rotate so it will work through the list), but no matter what settings one uses our services perform much more slowly if a primary dns server becomes unavailable. At the moment this is one of the largest sources of service disruptions for us. My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...", but if that's an option I haven't seen it. I was wondering how other folks handled this issue? I can see 3 possible types of solutions: Use linux-ha/Pacemaker and failover ips (so the dns IP VIPs are "always" available). Alas, we don't have a good fencing infrastructure, and without fencing pacemaker doesn't work very well (in my experience Pacemaker lowers availability without fencing). Run a local dns server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage. Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd seems to have the right feature set: it marks dns servers as up or down, and won't use 'down' dns servers. Any-casting seems to work only at the ip routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does not support broadcasting or multi-casting, and the docs I could find seem to suggest that multicast dns is more aimed at service discovery and auto-configuration rather than regular dns resolving. Am I missing an obvious solution?

    Read the article

  • What's the difference between keepalived and corosync, others?

    - by Matt
    I'm building a failover firewall for a server cluster and started looking at the various options. I'm more familiar with carp on freebsd, but need to use linux for this project. Searching google has produced several different projects, but no clear information about features they provide . CARP gave virtual interfaces that failover, I am not really clear on whether that's what corosync does, or is that what pacemaker does? On the other hand I did get manage to get keepalived working. However, I noted that corosync provides native support for infiniband. This would be useful for me. Perhaps someone could shed some light on the differences between: corosync keepalive pacemaker heartbeat Which product would be the best fit for router failover?

    Read the article

  • Handling database failover for Rails applications on FreeBSD

    - by bianster
    I'm working on implementing database (Postgresql) failover for a Rails app that runs with Passenger/FreeBSD. Due to certain constraints regarding the server OS, it's necessary to continue using FreeBSD (as opposed to say, Ubuntu). I'm finding it to be quite a challenge to have failover handled within the Rails application, by way of a customised database adapter due to the fact that this application will be load-balanced between several webservers, and the multiple Rails processes that Passenger spawned in each webserver. I previously looked at setting up Pacemaker/Corosync to manage database server failover on a common IP but unfortunately I wasn't able to get past building the packages on FreeBSD. It does work rather well on Ubuntu 10.04 but I'm not likely to be able to use Ubuntu due to the OS constraints. I'm considering a custom witness daemon that simply pings the primary DB server, and this witness daemon switches all the webservers to the standby DB server when the primary becomes uncontactable (permanently/temporarily), to avoid split-brain. Though I would really like to know if there is a way to get Pacemaker(or something similar) to do the switch on FreeBSD.

    Read the article

  • Network and Server Management Tools

    - by jessieE
    We are building a farm of test servers. Currently we have 8 servers. We are planning to use the servers to test the following Mysql Cluster Xen or KVM virtualization Heartbeat/Pacemaker/DRDB What tools do experienced sysads use for: Initial installation of operating system( installing centos 5 or ubuntu server manually 8 times seems like a tedious task that just begs for automation) Centralized Configuration Management and Software Updates for Host and possibly Guest(virtualized) servers Hardware, Services and Network Monitoring

    Read the article

  • Dedicated Servers: Is one better then two for LAMP pseudo HA setup? [closed]

    - by bikedorkseattle
    Possible Duplicate: How to find web hosting that meets my requirements? I know there are zillions of commentary about hosting out there, but I haven't read much about this. Our current well known host is having too many problems, the hardware we are on it subpar, and I'm ready to leave. A day of downtime can cost as much as our monthly hosting bill. A month of bad performance is just killing us right now, user and google wise. I'm wondering about running two dedicated boxes for LAMP, one running as the primary Nginx/Apache (proxy pass), and the other as the MySQL box. Running a single box scares the bejesus out of me because who knows how long it will take anyone to fix a raid card or whatever. The idea is to set this up using some sort of failover system using pacemaker and heartbeat. If one server goes down the other can take over for the other running both web and db. There are some good articles over at Linode about this. I have a few DBs that are 1GB+ and would like to load them into memory. Because of this, I'm shying away from a Linode HA setup because for the price I could do it with two dedicated like I described. Am I mad or an idiot? What are people out there doing for pseodu high availability good performance setups under $400/month? I'm a webmaster; I do a lot of things none of it that well :)

    Read the article

  • New Options for MySQL High Availability

    - by Mat Keep
    Data is the currency of today’s web, mobile, social, enterprise and cloud applications. Ensuring data is always available is a top priority for any organization – minutes of downtime will result in significant loss of revenue and reputation. There is not a “one size fits all” approach to delivering High Availability (HA). Unique application attributes, business requirements, operational capabilities and legacy infrastructure can all influence HA technology selection. And then technology is only one element in delivering HA – “People and Processes” are just as critical as the technology itself. For this reason, MySQL Enterprise Edition is available supporting a range of HA solutions, fully certified and supported by Oracle. MySQL Enterprise HA is not some expensive add-on, but included within the core Enterprise Edition offering, along with the management tools, consulting and 24x7 support needed to deliver true HA. At the recent MySQL Connect conference, we announced new HA options for MySQL users running on both Linux and Solaris: - DRBD for MySQL - Oracle Solaris Clustering for MySQL DRBD (Distributed Replicated Block Device) is an open source Linux kernel module which leverages synchronous replication to deliver high availability database applications across local storage. DRBD synchronizes database changes by mirroring data from an active node to a standby node and supports automatic failover and recovery. Linux, DRBD, Corosync and Pacemaker, provide an integrated stack of mature and proven open source technologies. DRBD Stack: Providing Synchronous Replication for the MySQL Database with InnoDB Download the DRBD for MySQL whitepaper to learn more, including step-by-step instructions to install, configure and provision DRBD with MySQL Oracle Solaris Cluster provides high availability and load balancing to mission-critical applications and services in physical or virtualized environments. With Oracle Solaris Cluster, organizations have a scalable and flexible solution that is suited equally to small clusters in local datacenters or larger multi-site, multi-cluster deployments that are part of enterprise disaster recovery implementations. The Oracle Solaris Cluster MySQL agent integrates seamlessly with MySQL offering a selection of configuration options in the various Oracle Solaris Cluster topologies. Putting it All Together When you add MySQL Replication and MySQL Cluster into the HA mix, along with 3rd party solutions, users have extensive choice (and decisions to make) to deliver HA services built on MySQL To make the decision process simpler, we have also published a new MySQL HA Solutions Guide. Exploring beyond just the technology, the guide presents a methodology to select the best HA solution for your new web, cloud and mobile services, while also discussing the importance of people and process in ensuring service continuity. This is subject recently presented at Oracle Open World, and the slides are available here. Whatever your uptime requirements, you can be sure MySQL has an HA solution for your needs Please don't hesitate to let us know of your HA requirements in the comments section of this blog. You can also contact MySQL consulting to learn more about their HA Jumpstart offering which will help you scope out your scaling and HA requirements.

    Read the article

  • Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?

    - by Quinn Murphy
    I had a scenario where a DRBD-heartbeat set up had a failed node but did not failover. What happened was the primary node had locked up, but didn't go down directly (it was inaccessible via ssh or with the nfs mount, but it could be pinged). The desired behavior would have been to detect this and failover to the secondary node, but it appears that since the primary didn't go full down (there is a dedicated network connection from server to server), heartbeat's detection mechanism didn't pick up on that and therefore didn't failover. Has anyone seen this? Is there something that I need to configure to have more robust cluster failover? DRBD seems to otherwise work fine (had to resync when I rebooted the old primary), but without good failover, it's use is limited. heartbeat 3.0.4 drbd84 RHEL 6.1 We are not using Pacemaker nfs03 is the primary server in this setup, and nfs01 is the secondary. ha.cf # Hearbeat Logging logfacility daemon udpport 694 ucast eth0 192.168.10.47 ucast eth0 192.168.10.42 # Cluster members node nfs01.openair.com node nfs03.openair.com # Hearbeat communication timing. # Sets the triggers and pulse time for swapping over. keepalive 1 warntime 10 deadtime 30 initdead 120 #fail back automatically auto_failback on and here is the haresources file: nfs03.openair.com IPaddr::192.168.10.50/255.255.255.0/eth0 drbddisk::data Filesystem::/dev/drbd0::/data::ext4 nfs nfslock

    Read the article

  • RHCS: GFS2 in A/A cluster with common storage. Configuring GFS with rgmanager

    - by Pavel A
    I'm configuring a two node A/A cluster with a common storage attached via iSCSI, which uses GFS2 on top of clustered LVM. So far I have prepared a simple configuration, but am not sure which is the right way to configure gfs resource. Here is the rm section of /etc/cluster/cluster.conf: <rm> <failoverdomains> <failoverdomain name="node1" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="rhc-n1"/> </failoverdomain> <failoverdomain name="node2" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="rhc-n2"/> </failoverdomain> </failoverdomains> <resources> <script file="/etc/init.d/clvm" name="clvmd"/> <clusterfs name="gfs" fstype="gfs2" mountpoint="/mnt/gfs" device="/dev/vg-cs/lv-gfs"/> </resources> <service name="shared-storage-inst1" autostart="0" domain="node1" exclusive="0" recovery="restart"> <script ref="clvmd"> <clusterfs ref="gfs"/> </script> </service> <service name="shared-storage-inst2" autostart="0" domain="node2" exclusive="0" recovery="restart"> <script ref="clvmd"> <clusterfs ref="gfs"/> </script> </service> </rm> This is what I mean: when using clusterfs resource agent to handle GFS partition, it is not unmounted by default (unless force_unmount option is given). This way when I issue clusvcadm -s shared-storage-inst1 clvm is stopped, but GFS is not unmounted, so a node cannot alter LVM structure on shared storage anymore, but can still access data. And even though a node can do it quite safely (dlm is still running), this seems to be rather inappropriate to me, since clustat reports that the service on a particular node is stopped. Moreover if I later try to stop cman on that node, it will find a dlm locking, produced by GFS, and fail to stop. I could have simply added force_unmount="1", but I would like to know what is the reason behind the default behavior. Why is it not unmounted? Most of the examples out there silently use force_unmount="0", some don't, but none of them give any clue on how the decision was made. Apart from that I have found sample configurations, where people manage GFS partitions with gfs2 init script - https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Defining_The_Resources or even as simply as just enabling services such as clvm and gfs2 to start automatically at boot (http://pbraun.nethence.com/doc/filesystems/gfs2.html), like: chkconfig gfs2 on If I understand the latest approach correctly, such cluster only controls whether nodes are still alive and can fence errant ones, but such cluster has no control over the status of its resources. I have some experience with Pacemaker and I'm used to that all resources are controlled by a cluster and an action can be taken when not only there are connectivity issues, but any of the resources misbehave. So, which is the right way for me to go: leave GFS partition mounted (any reasons to do so?) set force_unmount="1". Won't this break anything? Why this is not the default? use script resource <script file="/etc/init.d/gfs2" name="gfs"/> to manage GFS partition. start it at boot and don't include in cluster.conf (any reasons to do so?) This may be a sort of question that cannot be answered unambiguously, so it would be also of much value for me if you shared your experience or expressed your thoughts on the issue. How does for example /etc/cluster/cluster.conf look like when configuring gfs with Conga or ccs (they are not available to me since for now I have to use Ubuntu for the cluster)? Thanks you very much!

    Read the article

1