Recreating OMS instances in a HA environment when instances on all nodes are lost
- by rnigam
Oracle highly recommends deploying EM in a HA environment. The best practices for HA deployments, backup and housekeeping of your Enterprise Manager environment are documented in the Oracle Enterprise Manager Advanced Configuration Guide. It is imperative that there is a good disaster recovery plan in place for your EM deployment. In this post I want to talk about a customer who failed to do the correct planning and housekeeping for EM and landed in a situation where we the all the OMSes were nearly blown away had we not jumped to help.
We recently hit an issue at a customer site where we had a two node OMS setup of the Enterprise Manager and a RAC Database being used as the EM repository. An accidental delete of the OMS oracle home left us with a single node deployment. While we were trying to figure out a possible path to recover the first node, the second node was rebooted under a maintenance window. What followed was a complete site outage as the Admin and managed servers would not start on either of the nodes.
In my situation there were
- No backups of the Oracle Homes from any node
- No OMS Configuration snapshots (created using the “emctl exportconfig oms” command) and the instance home was completely lost on node 1 which also had the Admin Server
We did however have:
- A copy of the emkey.ora that I found under the OMS_ORACLE_HOME/ of the second node (NOTE: it is a bad practice to have your emkey present under the OMS Oracle home directory on the same server as the OMS. The backup of the emkey should be maintained on some other server. In this case however it was a savior in my situation since there were no backups
- The oms oracle home on the second node but missing a number of files and had a number of changes done to the files in the home. There were a number of attempts to start the server by modifying various files based on the Weblogic server logs to have atleast node up and running but all of them failed.
Here is how you can recover from this scenario:
Follow these steps:
STEP 1: Check status of emkey.ora
Check whether the emkey exists is present in the EM repository or not. Run the following command:
$OMS_ORACLE_HOME/bin/emctl status emkey
If the output is something like this below then you are good to go and the key is present in the repository
./emctl status emkey
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
Enter Enterprise Manager Root (SYSMAN) Password :
The EMKey is configured properly.
Here are the messages that you might see as the emctl status emkey output depending upon whether the EM Admin Server is up and if the key is configured properly:
Case1: AdminServer is up, emkey is proper in CredStore & not in repos. This is same as the output of the command shown above:The EMKey is configured properly
Case 2: AdminServer is up, emkey is proper in CredStore & exists in repos:The EMKey is configured properly, but is not secure. Secure the EMKey by running "emctl config emkey -remove_from_repos".Case 3: AdminServer is down or emkey is corrupted in CredStore) & (emkey exists in repos):
The EMKey exists in the Management Repository, but is not configured properly or is corrupted in the credential store.Configure the EMKey by running "emctl config emkey -copy_to_credstore".Case 4: (AdminServer is down or emkey is corrupted in CredStore) & (emkey does not exist in repos):
The EMKey is not configured properly or is corrupted in the credential store and does not exist in the Management Repository. To correct the problem:1) Get the backed up emkey.ora file.2) Configure the emkey by running "emctl config emkey -copy_to_credstore_from_file".
If not the key was not secured properly, we will have to be put in the repository before proceeding. Look at the next step 2 for doing this
There may be cases (like mine) where running emctl may give errors like the following:
$OMS_ORACLE_HOME/bin/emctl status emkey
Exception in thread “Main Thread” java.lang.NoClassDefFoundError: oracle/security/pki/OracleWallet
At oracle.sysman.emctl.config.oms.EMKeyCmds.main (EMKeyCmds.java:658)
Just move to the next step to put the key back in the repository
STEP 2: Put emkey.ora back in the repository
Skip this step if your emkey.ora is present in the repository. If not, you need to put the key back in the repository See if you can run the following command (with sample output):
$OMS_ORACLE_HOME/bin/emctl config emkey –copy_to_repos
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
The EMKey has been copied to the Management Repository. This operation will cause the EMKey to become unsecure.
After the required operation has been completed, secure the EMKey by running
"emctl config emkey -remove_from_repos".
Typically the key is present under $OMS_ORACLE_HOME/sysman/config directory before being removed after the install as a best practice. If you hit any errors while running emctl commands like the one mentioned in step 1, jump to step 3 and we will take care of the emkey.ora in Step 5
STEP 3: Get the port information
Check for the existing port information in the emd.properties file under EM_INSTANCE_DIRECTORY (typically gc_inst directory right above the Middleware home where you have deployed em. For eg. /u01/app/oracle/product/gc_inst in case your oms home is /u01/app/oracle/product/Middleware/oms11g)
In my case I got the information from the emgc.properties present in the gc_inst on the second node. If you can run emctl you may want to try the following command as well
$OMS_ORACLE_HOME/bin/emctl status oms –details
Note this information as this will be used in the next step
STEP 4: Perform cleanup on Node 1
Note the oracle home of the Weblogic and OMS, get the list of applied patches in the homes (using opatch lsinventory command), take a backup copy of the home just in case we need it and then de-install/remove oracle homes, update inventory and cleanup processes on the first node
STEP 5: Perform Software Only Installation of OMS on Node 1
Perform Weblogic 10.3.2 installation exactly under the same location as present in the earlier installation. Perform software only installation of the OMS using the following command. This will not run any configuration assistants and bypass all user interface validations
runInstaller –noconfig -validationaswarnings
Select the “Additional OMS” option while performing the installation. Provide the same path for OMS and Instance directories like the previous installation
Use the port information collected in Step 3 while performing the installation. Once the installation is complete run the allroot.sh script to complete the binary deployment
STEP 6: Apply one-off patches
At this point you can apply any patches to the OMS Oracle Home previously. You only need to run opatch to install the patch in the home and not required to run the SQLs
STEP 7: Copy EM key
This step is only required if you were not able to use emctl command to put the emkey back into the EM repository in STEP 2
Copy the emkey.ora file of the old installation you have under $OMS_ORACLE_HOME/sysman/config directory of the newly installed OMS
STEP 8: Configure Grid Control Domain
Run the following command to configure the EM domain and OMS. Note that you need to use a different GC Domain name than what you used earlier. For example I have used GCDOMAIN11 as the new domain name when my previous domain name was GCDOMAIN
$OMS_ORACLE_HOME/bin/omsca new –AS_USERNAME weblogic –EM_DOMAIN_NAME GCDOMAIN11 –NM_USER nodemanager -nostart
This command as shown below will prompt for a number of inputs like Admin Server hostname, port, password, etc. Verify if the defaults shown are correct by pressing enter or provide a new value
STEP 9: Run Add-ON Configuration Assistant
After this step run the following add-on configuration assistant. This was used in my case to configure the virtualization add-on
$OMS_ORACLE_HOME/addonca -oui -omsonly -name vt -install gc
STEP 10: Start the OMS
Now start the OMS using
$OMS_ORACLE_HOME/bin/emctl start oms
In a multi-node setup like mine you would either have a software load balancer or DNS round robin (using a virtual host name that resolves to one of multiple OMS hostnames) being used for load balancing. Secure the OMS against the SLB or DNS virtual hostname using the following
$ OMS_HOME/bin/emctl secure oms -host slb.example.com -secure_port 1159 -slb_port 1159 -slb_console_port 443
STEP 11: Configure the Agent
From the $AGENT_ORACLE_HOME/bin run the ./agentca –f
At this point you should have your OMS on node 1 fully re-covered. Clean up node 2 and use the normal Additional OMS installation process documented in the official installation guide to add the additional OMS on node 2
Summary
It took us nearly a little over two days to completely recover the environment with some other non-EM related issues that hit us along the way as well. In the end a situation like this could have been completely avoided had the proper housekeeping and backup of the Enterprise Manager Deployment been done in the first place. This is going to a topic that we cover in the next post. In the meantime please do refer to the Oracle Enterprise Manager Advanced Configuration Guide for planning your EM installation, backup and housekeeping procedures. This can be found here:
http://download.oracle.com/docs/cd/E11857_01/index.htm
Thanks
This post would not have been possible without Raj Aggarwal, Prasad Chebrolu and Ravikumar Basa who helped to recover the environment and provided all the support we needed