Clusterware 11gR2 – Setting up an Active/Passive failover configuration
- by Gilles Haro
Oracle is providing a large range of interesting solutions to ensure High Availability of the database. Dataguard, RAC or even both configurations (as recommended by Oracle for a Maximum Available Architecture - MAA) are the most frequently found and used solutions. But, when it comes to protecting a system with an Active/Passive architecture with failover capabilities, people often thinks to other expensive third party cluster systems. Oracle Clusterware technology, which comes along at no extra-cost with Oracle Database or Oracle Unbreakable Linux, is - in the knowing of most people - often linked to Oracle RAC and therefore, is seldom used to implement failover solutions. Oracle Clusterware 11gR2 (a part of Oracle 11gR2 Grid Infrastructure) provides a comprehensive framework to setup automatic failover configurations. It is actually possible to make "failover-able'", and then to protect, almost any kind of application (from the simple xclock to the most complex Application Server). Quoting Oracle: “Oracle Clusterware is a portable cluster software that allows clustering of single servers so that they cooperate as a single system. Oracle Clusterware also provides the required infrastructure for Oracle Real Application Clusters (RAC). In addition Oracle Clusterware enables the protection of any Oracle application or any other kind of application within a cluster.” In the next couple of lines, I will try to present the different steps to achieve this goal : Have a fully operational 11gR2 database protected by automatic failover capabilities. I assume you are fluent in installing Oracle Database 11gR2, Oracle Grid Infrastructure 11gR2 on a Linux system and that ASM is not a problem for you (as I am using it as a shared storage). If not, please have a look at Oracle Documentation. As often, I made my tests using an Oracle VirtualBox environment. The scripts are tested and functional on my system. Unfortunately, there can always be a typo or a mistake. This blog entry does not replace a course around the Clusterware Framework. I just hope it will let you see how powerful it is and that it will give you the whilst to go further with it... Note : This entry has been revised (rev.2) following comments from Philip Newlan. Prerequisite 2 Linux boxes (OELCluster01 and OELCluster02) at the same OS level. I used OEL 5 Update 5 with an Enterprise Kernel. Shared Storage (SAN). On my VirtualBox system, I used Openfiler to simulate the SAN Oracle 11gR2 Database (11.2.0.1) Oracle 11gR2 Grid Infrastructure (11.2.0.1) Step 1 - Install the software Using asmlib, create 3 ASM disks (ASM_CRS, ASM_DTA and ASM_FRA) Install Grid Infrastructure for a cluster (OELCluster01 and OELCluster02 are the 2 nodes of the cluster) Use ASM_CRS to store Voting Disk and OCR. Use SCAN. Install Oracle Database Standalone binaries on both nodes. Use asmca to check/mount the disk groups on 2 nodes Use dbca to create and configure a database on the primary node Let's name it DB11G. Copy the pfile, password file to the second node. Create adump directoty on the second node. Step 2 - Setup the resource to be protected After its creation with dbca, the database is automatically protected by the Oracle Restart technology available with Grid Infrastructure. Consequently, it restarts automatically (if possible) after a crash (ex: kill -9 smon). A database resource has been created for that in the Cluster Registry. We can observe this with the command : crsctl status resource that shows and ora.dba11g.db entry. Let's save the definition of this resource, for future use : mkdir -p /crs/11.2.0/HA_scripts chown oracle:oinstall /crs/11.2.0/HA_scripts crsctl status resource ora.db11g.db -p > /crs/11.2.0/HA_scripts/myResource.txt Although very interesting, Oracle Restart is not cluster aware and cannot restart the database on any other node of the cluster. So, let's remove it from the OCR definitions, we don't need it ! srvctl stop database -d DB11G srvctl remove database -d DB11G Instead of it, we need to create a new resource of a more general type : cluster_resource. Here are the steps to achieve this : Create an action script : /crs/11.2.0/HA_scripts/my_ActivePassive_Cluster.sh #!/bin/bash export ORACLE_HOME=/oracle/product/11.2.0/dbhome_1 export ORACLE_SID=DB11G case $1 in 'start') $ORACLE_HOME/bin/sqlplus /nolog <<EOF connect / as sysdba startup EOF RET=0 ;; 'stop') $ORACLE_HOME/bin/sqlplus /nolog <<EOF connect / as sysdba shutdown immediate EOF RET=0 ;; 'clean') $ORACLE_HOME/bin/sqlplus /nolog <<EOF connect / as sysdba shutdown abort ##for i in `ps -ef | grep -i $ORACLE_SID | awk '{print $2}' ` ;do kill -9 $i; done EOF RET=0 ;; 'check') ok=`ps -ef | grep smon | grep $ORACLE_SID | wc -l` if [ $ok = 0 ]; then RET=1 else RET=0 fi ;; '*') RET=0 ;; esac if [ $RET -eq 0 ]; then exit 0 else exit 1 fi This script must provide, at least, methods to start, stop, clean and check the database. It is self-explaining and contains nothing special. Just be aware that it must be runnable (+x), it runs as Oracle user (because of the ACL property - see later) and needs to know about the environment. Also make sure it exists on every node of the cluster. Moreover, as of 11.2, the clean method is mandatory. It must provide the “last gasp clean up”, for example, a shutdown abort or a kill –9 of all the remaining processes. chmod +x /crs/11.2.0/HA_scripts/my_ActivePassive_Cluster.sh scp /crs/11.2.0/HA_scripts/my_ActivePassive_Cluster.sh oracle@OELCluster02:/crs/11.2.0/HA_scripts Create a new resource file, based on the information we got from previous myResource.txt . Name it myNewResource.txt. myResource.txt is shown below. As we can see, it defines an ora.database.type resource, named ora.db11g.db. A lot of properties are related to this type of resource and do not need to be used for a cluster_resource. NAME=ora.db11g.db TYPE=ora.database.type ACL=owner:oracle:rwx,pgrp:oinstall:rwx,other::r-- ACTION_FAILURE_TEMPLATE= ACTION_SCRIPT= ACTIVE_PLACEMENT=1 AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX% AUTO_START=restore CARDINALITY=1 CHECK_INTERVAL=1 CHECK_TIMEOUT=600 CLUSTER_DATABASE=false DB_UNIQUE_NAME=DB11G DEFAULT_TEMPLATE=PROPERTY(RESOURCE_CLASS=database) PROPERTY(DB_UNIQUE_NAME= CONCAT(PARSE(%NAME%, ., 2), %USR_ORA_DOMAIN%, .)) ELEMENT(INSTANCE_NAME= %GEN_USR_ORA_INST_NAME%) DEGREE=1 DESCRIPTION=Oracle Database resource ENABLED=1 FAILOVER_DELAY=0 FAILURE_INTERVAL=60 FAILURE_THRESHOLD=1 GEN_AUDIT_FILE_DEST=/oracle/admin/DB11G/adump GEN_USR_ORA_INST_NAME= GEN_USR_ORA_INST_NAME@SERVERNAME(oelcluster01)=DB11G HOSTING_MEMBERS= INSTANCE_FAILOVER=0 LOAD=1 LOGGING_LEVEL=1 MANAGEMENT_POLICY=AUTOMATIC NLS_LANG= NOT_RESTARTING_TEMPLATE= OFFLINE_CHECK_INTERVAL=0 ORACLE_HOME=/oracle/product/11.2.0/dbhome_1 PLACEMENT=restricted PROFILE_CHANGE_TEMPLATE= RESTART_ATTEMPTS=2 ROLE=PRIMARY SCRIPT_TIMEOUT=60 SERVER_POOLS=ora.DB11G SPFILE=+DTA/DB11G/spfileDB11G.ora START_DEPENDENCIES=hard(ora.DTA.dg,ora.FRA.dg) weak(type:ora.listener.type,uniform:ora.ons,uniform:ora.eons) pullup(ora.DTA.dg,ora.FRA.dg) START_TIMEOUT=600 STATE_CHANGE_TEMPLATE= STOP_DEPENDENCIES=hard(intermediate:ora.asm,shutdown:ora.DTA.dg,shutdown:ora.FRA.dg) STOP_TIMEOUT=600 UPTIME_THRESHOLD=1h USR_ORA_DB_NAME=DB11G USR_ORA_DOMAIN=haroland USR_ORA_ENV= USR_ORA_FLAGS= USR_ORA_INST_NAME=DB11G USR_ORA_OPEN_MODE=open USR_ORA_OPI=false USR_ORA_STOP_MODE=immediate VERSION=11.2.0.1.0 I removed database type related entries from myResource.txt and modified some other to produce the following myNewResource.txt. Notice the NAME property that should not have the ora. prefix Notice the TYPE property that is not ora.database.type but cluster_resource. Notice the definition of ACTION_SCRIPT. Notice the HOSTING_MEMBERS that enumerates the members of the cluster (as returned by the olsnodes command). NAME=DB11G.db TYPE=cluster_resource DESCRIPTION=Oracle Database resource ACL=owner:oracle:rwx,pgrp:oinstall:rwx,other::r-- ACTION_SCRIPT=/crs/11.2.0/HA_scripts/my_ActivePassive_Cluster.sh PLACEMENT=restricted ACTIVE_PLACEMENT=0 AUTO_START=restore CARDINALITY=1 CHECK_INTERVAL=10 DEGREE=1 ENABLED=1 HOSTING_MEMBERS=oelcluster01 oelcluster02 LOGGING_LEVEL=1 RESTART_ATTEMPTS=1 START_DEPENDENCIES=hard(ora.DTA.dg,ora.FRA.dg) weak(type:ora.listener.type,uniform:ora.ons,uniform:ora.eons) pullup(ora.DTA.dg,ora.FRA.dg) START_TIMEOUT=600 STOP_DEPENDENCIES=hard(intermediate:ora.asm,shutdown:ora.DTA.dg,shutdown:ora.FRA.dg) STOP_TIMEOUT=600 UPTIME_THRESHOLD=1h Register the resource. Take care of the resource type. It needs to be a cluster_resource and not a ora.database.type resource (Oracle recommendation) . crsctl add resource DB11G.db -type cluster_resource -file /crs/11.2.0/HA_scripts/myNewResource.txt Step 3 - Start the resource crsctl start resource DB11G.db This command launches the ACTION_SCRIPT with a start and a check parameter on the primary node of the cluster. Step 4 - Test this We will test the setup using 2 methods. crsctl relocate resource DB11G.db This command calls the ACTION_SCRIPT (on the two nodes) to stop the database on the active node and start it on the other node. Once done, we can revert back to the original node, but, this time we can use a more "MS$ like" method :Turn off the server on which the database is running. After short delay, you should observe that the database is relocated on node 1. Conclusion Once the software installed and the standalone database created (which is a rather common and usual task), the steps to reach the objective are quite easy : Create an executable action script on every node of the cluster. Create a resource file. Create/Register the resource with OCR using the resource file. Start the resource. This solution is a very interesting alternative to licensable third party solutions. References Clusterware 11gR2 documentation Oracle Clusterware Resource Reference Clusterware for Unbreakable Linux Using Oracle Clusterware to Protect A Single Instance Oracle Database 11gR1 (to have an idea of complexity) Oracle Clusterware on OTN Gilles Haro Technical Expert - Core Technology, Oracle Consulting