My setup:
i have 3 nearly identical webserver machines serving the same high loaded dynamic website with simple load balancing over dns. The service has been working for over two ears with the same apache config.
apache2, php5, ubuntu 8.04 linux 2.6.24-29-server
My problem:
since about two weeks i'm experiencing problems with this config. Nearly every day i have one small moment about 5 minutes, in which the website is unreachable. I'm still able to login to the servers over ssh. If i run htop, i see the machine simply doing nothing. i have about 1000 apache processes running, but no cpu activity.
i've used the apache mod_status to debug this situation. the process scoreboard looks like this:
_C.___K_______________________R._______.__K_K____K___C_______.__
_______C__________.___________________________________.________C
_.____K__________K___K_WK_____._K_____________________________._
W______K__________K________.____________________._______C_______
_C_.__K__K____.._.._____________________________________C_______
_R___________K___.______C________.C_________.______._____C______
____________KKC____K_____K__WC_________________C_____.__.____.__
_____________________C_________K______.____C______._____________
_.___C____.___.___________________________.K______.____K________
W__.___________________C.__.____K________K_______R_._.__._______
__C__C_.__________C__C_______._____W______________C_.___C_______
____.______C_____________C________.____C____________.________._K
__.__________.K_____________K_________._____C____.K__________KW_
__K.W________R_________._______.___W___________.____.__K_____W__
W___.___..________W____K
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process
So the most of the processes are just waiting for connection. after about 5 minutes the situation will return to normal: i have lot least processes on every machine, the most workers have the "."-status (meaing they are open to process a request) and of course the website is reachable!
so i'm trying to find something in the logs, but there is simply nothing... the apache access log is silent for about 4 minutes, the same is for the error log. i also can not figure out anything wrong in other system logs.
the situation is the same on all 3 webservers (all of them have this load peak and unresposibility at the same time), so i do not thing this is hardware related.
but i think, this might be related to some network (tcp) issue.
any ideas?
EDIT:
some more information, that i have just discovered:
it has just happened again. and i was able to verify that i'm also not able to connect locally when this problem occurs.
i have made some connection statistics with the following command after it happend
netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
109 CLOSE_WAIT
2652 ESTABLISHED
2 FIN_WAIT1
11 LAST_ACK
12 LISTEN
91 SYN_RECV
1 SYN_SENT
16 TIME_WAIT
If i execute the same command some time later, i have something like this:
4 CLOSING
108 ESTABLISHED
18 FIN_WAIT1
182 FIN_WAIT2
37 LAST_ACK
12 LISTEN
50 SYN_RECV
11276 TIME_WAIT
So in the normal situation i have only 100-200 open connections by clients beeing handled by apache in this moment. when i have this "crash", i have a lot more connections. what is the best way to analyse this?
EDIT2:
the important lines in apache2.conf are:
KeepAlive On
MaxKeepAliveRequests 20
KeepAliveTimeout 1
<IfModule mpm_prefork_module>
ServerLimit 920
StartServers 30
MinSpareServers 80
MaxSpareServers 120
MaxClients 920
MaxRequestsPerChild 700
</IfModule>
it is an apache2 prefork with php_mod.
the server has 8GB ram and a 4gb swap partition.