Nagios Event Handler is not triggering when
the service is taking more time to response or down.
My configuration in below
nagios.cfg
enable_event_handlers=1
localhost.cfg
define service {
use generic-service
host_name Server
service_description test-server
servicegroups test-service
check_command check-service
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 2
retry_check_interval 2
contact_groups testcontacts
notification_period 24x7
notification_options w,u,c,r
notifications_enabled 1
event_handler_enabled 1
event_handler recheck-service
}
command.cfg
define command{
command_name recheck-service
command_line /usr/local/nagios/libexec/alert.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
alert.sh file
!/bin/sh
set -x
case "$1" in
OK)
#
The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since
the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an
unknown error, so don't do anything...
;;
CRITICAL)
Aha!
The HTTP service appears to have a problem - perhaps we should restart
the server...
Is this a "soft" or a "hard" state?
case "$2" in
We're in a "soft" state, meaning that Nagios is in
the middle of retrying
the
check before it turns into a "hard" state and contacts get notified...
SOFT)
# What check attempt are we on? We don't want to restart
the web server on
the first
check, because it may just be a fluke!
case "$3" in
Wait until
the check has been tried 3 times before restarting
the web server.
If
the check fails on
the 4th time (after we restart
the web server),
the state
type will turn to "hard" and contacts will be notified of
the problem.
Hopefully this will restart
the web server successfully, so
the 4th check will
result in a "soft" recovery. If that happens no one gets notified because we
fixed
the problem!
3)
echo -n "Going To Ping
the Virtual Machine (3rd soft critical state)..."
# Call
the init script to restart
the HTTPD server
myresult=`/usr/local/nagios/libexec/check_http xyz.com -t 100 | grep 'time'| awk '{print $10}'`
echo "Your Service Is taking
the following time Delay" "$myresult Seconds" |mail -s "WARNING : Service Taken More Time To Response"
[email protected]
;;
esac
;;
#
The HTTP service somehow managed to turn into a hard error without getting fixed.
# It should have been restarted by
the code above, but for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem with
the service at this