How to find the reason for a weekly downtime on an Ubuntu web server hosted by AWS?
- by IceSheep
We started monitoring our web server using Pingdom and found out that we have a downtime of a few minutes every Sunday at 0:00 UTC.
The test runs every minute and checks if a successful HTTP response (code 200) is returned on port 80. The test fails due to a timeout (no response after 30 seconds).
Here's what we've already checked – without success:
Since we run our webserver behind a load balancer, I've set the Pingdom test on the load balancer's public DNS and the webserver's public DNS in order to find out if there's a problem with the AWS load balancer – both tests return the same result
We set up Munin on our webserver. Everything looked fine even after the failure. Since the last failure lasted only 2 minutes I suppose Munin couldn't capture a potential problem (it only checks every 5 minutes)
I have checked /var/log/apache2/error.log and /var/log/syslog for suspicious entries
I have checked /etc/cron.weekly and /etc/crontab for suspicious entries
I have searched for files created or last-modified during 0:00 and 0:15 using this method:
touch -t 201209020000 start
touch -t 201209020015 end
find / -newer start -and ! -newer end
(nothing found)
Has anybody experienced a similar problem? Any proposals on how to find the reason for this behavior?
It's Ubuntu 10.04 LTS running on an AWS m1.large instance.
Thanks!