Recreating OMS instances in a HA environment when instances on all nodes are lost

Posted by rnigam on Oracle Blogs See other posts from Oracle Blogs or by rnigam
Published on Sat, 25 Jun 2011 15:55:38 -0700 Indexed on 2011/06/26 0:27 UTC
Read the original article Hit count: 780

Oracle highly recommends deploying EM in a HA environment. The best practices for HA deployments, backup and housekeeping of your Enterprise Manager environment are documented in the Oracle Enterprise Manager Advanced Configuration Guide. It is imperative that there is a good disaster recovery plan in place for your EM deployment. In this post I want to talk about a customer who failed to do the correct planning and housekeeping for EM and landed in a situation where we the all the OMSes were nearly blown away had we not jumped to help.


We recently hit an issue at a customer site where we had a two node OMS setup of the Enterprise Manager and a RAC Database being used as the EM repository. An accidental delete of the OMS oracle home left us with a single node deployment. While we were trying to figure out a possible path to recover the first node, the second node was rebooted under a maintenance window. What followed was a complete site outage as the Admin and managed servers would not start on either of the nodes.


In my situation there were


- No backups of the Oracle Homes from any node


- No OMS Configuration snapshots (created using the “emctl exportconfig oms” command) and the instance home was completely lost on node 1 which also had the Admin Server


 We did however have:


- A copy of the emkey.ora that I found under the OMS_ORACLE_HOME/ of the second node (NOTE: it is a bad practice to have your emkey present under the OMS Oracle home directory on the same server as the OMS. The backup of the emkey should be maintained on some other server. In this case however it was a savior in my situation since there were no backups


- The oms oracle home on the second node but missing a number of files and had a number of changes done to the files in the home. There were a number of attempts to start the server by modifying various files based on the Weblogic server logs to have atleast node up and running but all of them failed.


Here is how you can recover from this scenario:


Follow these steps:


STEP 1: Check status of emkey.ora


Check whether the emkey exists is present in the EM repository or not. Run the following command:


$OMS_ORACLE_HOME/bin/emctl status emkey


If the output is something like this below then you are good to go and the key is present in the repository


./emctl status emkey


Oracle Enterprise Manager 11g Release 1 Grid Control


Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.


Enter Enterprise Manager Root (SYSMAN) Password :


The EMKey is configured properly.


Here are the messages that you might see as the emctl status emkey output depending upon whether the EM Admin Server is up and if the key is configured properly:


Case1:  AdminServer is up, emkey is proper in CredStore & not in repos. This is same as the output of the command shown above:

The EMKey is configured properly


Case 2: AdminServer is up, emkey is proper in CredStore & exists in repos:

The EMKey is configured properly, but is not secure.
Secure the EMKey by running "emctl config emkey -remove_from_repos".


Case 3: AdminServer is down or emkey is corrupted in CredStore) & (emkey exists in repos):


The EMKey exists in the Management Repository, but is not configured properly or is corrupted in the credential store.
Configure the EMKey by running "emctl config emkey -copy_to_credstore".

Case 4: (AdminServer is down or emkey is corrupted in CredStore) & (emkey does not exist in repos):


The EMKey is not configured properly or is corrupted in the credential store and does not exist in the Management Repository. To correct the problem:
1) Get the backed up emkey.ora file.
2) Configure the emkey by running "emctl config emkey -copy_to_credstore_from_file".


If not the key was not secured properly, we will have to be put in the repository before proceeding. Look at the next step 2 for doing this


There may be cases (like mine) where running emctl may give errors like the following:


$OMS_ORACLE_HOME/bin/emctl status emkey


Exception in thread “Main Thread” java.lang.NoClassDefFoundError: oracle/security/pki/OracleWallet


At oracle.sysman.emctl.config.oms.EMKeyCmds.main (EMKeyCmds.java:658)


Just move to the next step to put the key back in the repository


STEP 2: Put emkey.ora back in the repository


Skip this step if your emkey.ora is present in the repository. If not, you need to put the key back in the repository See if you can run the following command (with sample output):


$OMS_ORACLE_HOME/bin/emctl config emkey –copy_to_repos


Oracle Enterprise Manager 11g Release 1 Grid Control


Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.


The EMKey has been copied to the Management Repository. This operation will cause the EMKey to become unsecure.


After the required operation has been completed, secure the EMKey by running


"emctl config emkey -remove_from_repos".


Typically the key is present under $OMS_ORACLE_HOME/sysman/config directory before being removed after the install as a best practice. If you hit any errors while running emctl commands like the one mentioned in step 1, jump to step 3 and we will take care of the emkey.ora in Step 5


STEP 3: Get the port information


Check for the existing port information in the emd.properties file under EM_INSTANCE_DIRECTORY (typically gc_inst directory right above the Middleware home where you have deployed em. For eg. /u01/app/oracle/product/gc_inst in case your oms home is /u01/app/oracle/product/Middleware/oms11g)


In my case I got the information from the emgc.properties present in the gc_inst on the second node. If you can run emctl you may want to try the following command as well


$OMS_ORACLE_HOME/bin/emctl status oms –details


Note this information as this will be used in the next step


STEP 4: Perform cleanup on Node 1


Note the oracle home of the Weblogic and OMS, get the list of applied patches in the homes (using opatch lsinventory command), take a backup copy of the home just in case we need it and then de-install/remove oracle homes, update inventory and cleanup processes on the first node


STEP 5: Perform Software Only Installation of OMS on Node 1


Perform Weblogic 10.3.2 installation exactly under the same location as present in the earlier installation. Perform software only installation of the OMS using the following command. This will not run any configuration assistants and bypass all user interface validations


runInstaller –noconfig -validationaswarnings


Select the “Additional OMS” option while performing the installation. Provide the same path for OMS and Instance directories like the previous installation


Use the port information collected in Step 3 while performing the installation. Once the installation is complete run the allroot.sh script to complete the binary deployment


STEP 6: Apply one-off patches


At this point you can apply any patches to the OMS Oracle Home previously. You only need to run opatch to install the patch in the home and not required to run the SQLs


STEP 7: Copy EM key


This step is only required if you were not able to use emctl command to put the emkey back into the EM repository in STEP 2


Copy the emkey.ora file of the old installation you have under $OMS_ORACLE_HOME/sysman/config directory of the newly installed OMS


STEP 8: Configure Grid Control Domain


Run the following command to configure the EM domain and OMS. Note that you need to use a different GC Domain name than what you used earlier. For example I have used GCDOMAIN11 as the new domain name when my previous domain name was GCDOMAIN


$OMS_ORACLE_HOME/bin/omsca new –AS_USERNAME weblogic –EM_DOMAIN_NAME GCDOMAIN11 –NM_USER nodemanager -nostart


This command as shown below will prompt for a number of inputs like Admin Server hostname, port, password, etc. Verify if the defaults shown are correct by pressing enter or provide a new value


STEP 9: Run Add-ON Configuration Assistant


After this step run the following add-on configuration assistant. This was used in my case to configure the virtualization add-on


$OMS_ORACLE_HOME/addonca -oui -omsonly -name vt -install gc


STEP 10: Start the OMS


Now start the OMS using


$OMS_ORACLE_HOME/bin/emctl start oms


In a multi-node setup like mine you would either have a software load balancer or DNS round robin (using a virtual host name that resolves to one of multiple OMS hostnames) being used for load balancing. Secure the OMS against the SLB or DNS virtual hostname using the following


$ OMS_HOME/bin/emctl secure oms -host slb.example.com -secure_port 1159 -slb_port 1159 -slb_console_port 443


STEP 11: Configure the Agent


From the $AGENT_ORACLE_HOME/bin run the ./agentca –f


At this point you should have your OMS on node 1 fully re-covered. Clean up node 2 and use the normal Additional OMS installation process documented in the official installation guide to add the additional OMS on node 2


Summary


It took us nearly a little over two days to completely recover the environment with some other non-EM related issues that hit us along the way as well. In the end a situation like this could have been completely avoided had the proper housekeeping and backup of the Enterprise Manager Deployment been done in the first place. This is going to a topic that we cover in the next post. In the meantime please do refer to the Oracle Enterprise Manager Advanced Configuration Guide for planning your EM installation, backup and housekeeping procedures. This can be found here:


http://download.oracle.com/docs/cd/E11857_01/index.htm


Thanks


This post would not have been possible without Raj Aggarwal, Prasad Chebrolu and Ravikumar Basa who helped to recover the environment and provided all the support we needed

© Oracle Blogs or respective owner

Related posts about /Oracle/Enterprise Manager Installation

Related posts about upgrades