How to find the reason for a weekly downtime on an Ubuntu web server hosted by AWS?
Posted
by
IceSheep
on Server Fault
See other posts from Server Fault
or by IceSheep
Published on 2012-09-03T14:20:28Z
Indexed on
2012/11/25
11:07 UTC
Read the original article
Hit count: 177
We started monitoring our web server using Pingdom and found out that we have a downtime of a few minutes every Sunday at 0:00 UTC.
The test runs every minute and checks if a successful HTTP response (code 200) is returned on port 80. The test fails due to a timeout (no response after 30 seconds).
Here's what we've already checked – without success:
Since we run our webserver behind a load balancer, I've set the Pingdom test on the load balancer's public DNS and the webserver's public DNS in order to find out if there's a problem with the AWS load balancer – both tests return the same result
We set up Munin on our webserver. Everything looked fine even after the failure. Since the last failure lasted only 2 minutes I suppose Munin couldn't capture a potential problem (it only checks every 5 minutes)
I have checked /var/log/apache2/error.log and /var/log/syslog for suspicious entries
I have checked /etc/cron.weekly and /etc/crontab for suspicious entries
I have searched for files created or last-modified during 0:00 and 0:15 using this method:
touch -t 201209020000 start
touch -t 201209020015 end
find / -newer start -and ! -newer end(nothing found)
Has anybody experienced a similar problem? Any proposals on how to find the reason for this behavior?
It's Ubuntu 10.04 LTS running on an AWS m1.large instance.
Thanks!
© Server Fault or respective owner