How To Clear An Alert - Part 2

Posted by werner.de.gruyter on Oracle Blogs See other posts from Oracle Blogs or by werner.de.gruyter
Published on Mon, 05 Apr 2010 12:17:23 -0800 Indexed on 2010/04/05 20:34 UTC
Read the original article Hit count: 829

Filed under:

best practices

|

Tips

|

alert

|

error

|

METRIC

There were some interesting comments and remarks on the original posting, so I decided to do a follow-up and address some of the issues that got raised...

Handling Metric Errors
First of all, there is a significant difference between an 'error' and an 'alert'.

An 'alert' is the violation of a condition (a threshold) specified for a given metric. That means that the Agent is collecting and gathering the data for the metric, but there is a situation that requires the attention of an administrator.

An 'error' on the other hand however, is a failure to collect metric data: The Agent is throwing the error because it cannot determine the value for the metric

Whereas the 'alert' guarantees continuity of the metric data, an 'error' signals a big unknown. And the unknown aspect of all this is what makes an error a lot more serious than a regular alert: If you don't know what the current state of affairs is, there could be some serious issues brewing that nobody is aware of...

The life-cycle of a Metric Error
Clearing a metric error is pretty much the same workflow as a metric 'alert':

The Agent signals the error after it failed to execute the metric

The error is uploaded to the OMS/repository, where it becomes visible in the Console

The error will remain active until the Agent is able to execute the metric successfully. Even though the metric is still getting scheduled and executed on a regular basis, the error will remain outstanding as long as the Agent is not capable of executing the metric correctly

Knowing this, the way to fix the metric error should be obvious: Take the 'problem' away, and as soon as the metric is executed again (based on the frequency of the metric), the error will go away.
The same tricks used to clear alerts can be used here too:

Wait for the next scheduled execution. For those metrics that are executed regularly (like every 15 minutes or so), it's just a matter of waiting those minutes to see the updates.

The 'Reevaluate Alert' button can be used to force a re-execution of the metric. In case a metric is executed once a day, this will be a better way to make sure that the underlying problem has been solved. And if it has been, the metric error will be removed, and the regular data points will be uploaded to the repository.

And just in case you have to 'force' the issue a little: If you disable and re-enable a metric, it will get re-scheduled. And that means a new metric execution, and an update of the (hopefully) fixed problem.

Database server-generated alerts and problem checkers

There are various ways the Agent can collect metric data: Via a script or a SQL statement, reading a log file, getting a value from an SNMP OID or listening for SNMP traps or via the DBMS_SERVER_ALERTS mechanism of an Oracle database.
For those alert which are generated by the database (like tablespace metrics for 10g and above databases), the Agent just 'waits' for the database to report any new findings. If the Agent has lost the current state of the server-side metrics (due to an incomplete recovery after a disaster, or after an improper use of the 'emctl clearstate' command), the Agent might be still aware of an alert that the database no longer has (or vice versa).
The same goes for 'problem checker' alerts: Those metrics that only report data if there is a problem (like the 'invalid objects' metric) will also have a problem if the Agent state has been tampered with (again, the incomplete recovery, and after improper use of 'emctl clearstate' are the two main causes for this).

The best way to deal with these kinds of mismatches, is to simple disable and re-enable the metric again: The disabling will clear the state of the metric, and the re-enabling will force a re-execution of the metric, so the new and updated results can get uploaded to the repository.

Starting 10gR5, the Agent performs additional checks and verifications after each restart of the Agent and/or each state change of the database (shutdown/startup or failover in case of DataGuard) to catch these kinds of mismatches.

Developer IT