Nagios Event Handler is not triggering when the service is taking more time to response or down.
My configuration in below
nagios.cfg
enable_event_handlers=1
localhost.cfg
define service {
use generic-service
host_name Server
service_description test-server
servicegroups test-service
check_command check-service
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 2
retry_check_interval 2
contact_groups testcontacts
notification_period 24x7
notification_options w,u,c,r
notifications_enabled 1
event_handler_enabled 1
event_handler recheck-service
}
command.cfg
define command{
command_name recheck-service
command_line /usr/local/nagios/libexec/alert.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
alert.sh file
!/bin/sh
set -x
case "$1" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
Aha! The HTTP service appears to have a problem - perhaps we should restart the server...
Is this a "soft" or a "
hard" state?
case "$2" in
We're in a "soft" state, meaning that Nagios is in the middle of retrying the
check before it turns into a "
hard" state and contacts get notified...
SOFT)
# What check attempt are we on? We don't want to restart the web server on the first
check, because it may just be a fluke!
case "$3" in
Wait until the check has been tried 3 times before restarting the web server.
If the check fails on the 4th time (after we restart the web server), the state
type will turn to "
hard" and contacts will be notified of the problem.
Hopefully this will restart the web server successfully, so the 4th check will
result in a "soft" recovery. If that happens no one gets notified because we
fixed the problem!
3)
echo -n "Going To Ping the Virtual Machine (3rd soft critical state)..."
# Call the init script to restart the HTTPD server
myresult=`/usr/local/nagios/libexec/check_http xyz.com -t 100 | grep 'time'| awk '{print $10}'`
echo "Your Service Is taking the following time Delay" "$myresult Seconds" |mail -s "WARNING : Service Taken More Time To Response"
[email protected]
;;
esac
;;
# The HTTP service somehow managed to turn into a
hard error without getting fixed.
# It should have been restarted by the code above, but for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem with the service at this