How to find out what is causing a slow down of the application on this server?
- by Jan P.
This is not the typical serverfault question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)