PHP-FPM stops responding and dies [migrated]
- by user12361
I'm running Drupal 6 with Nginx 1.5.1 and PHP-FPM (PHP 5.3.26) on a 1GB single core VPS with 3GB of swap space on SSD storage. I just switched from shared hosting to this unmanaged VPS because my site was getting too heavy, so I'm still learning the ropes. I have moderately high traffic, I don't really monitor it closely but Google Adsense usually record close to 30K page views/day. I usually have 50 to 80 authenticated users logged in and a few hundred more anonymous users hitting the Boost static HTML cache at any given moment.
The problem I'm having is that PHP-FPM frequently stops responding, resulting in Nginx 502 or 504 errors. I swear I have read every page on the internet about this issue, which seems fairly common, and I've tried endless combinations of configurations, and I can't find a good solution. After restarting Nginx and PHP-FPM, the site runs really fast for a while, and then without warning it simply stops responding. I get a white screen while the browser waits on the server, and after about 30 seconds to a minute it throws an Nginx 502 or 504 error. Sometimes it runs well for 2 minutes, sometimes 5 minutes, sometimes 5 hours, but it always ends up hanging. When I find the server in this state, there is still plenty of free memory (500MB or more) and no major CPU usage, the control and worker PHP-FPM processes are still present, and the server is still pingable and usable via SSH. A reload of PHP-FPM via the init script revives it again. The hangups don't seem to correspond to the amount of traffic, because I observed this behavior consistently when I was testing this configuration on a development VPS with no traffic at all.
I've been constantly tweaking the settings, but I can't definitively eliminate the problem. I set Nginx workers to just 1. In the PHP-FPM config I have tried all three of the process managers. "Dynamic" is definitely the least reliable, consistently hanging up after only a few minutes. "Static" also has been unreliable and unpredictable. The least buggy has been "ondemand", but even that is failing me, sometimes after as much as 12 to 24 hours. But I can't leave the server unattended because PHP-FPM dies and never comes back on its own. I tried adjusting the pm.max_children value from as low as 3 to as high as 50, doesn't make a lot of difference, but I currently have it at 10. Same thing for the spare servers values. I also have set pm.max_requests anywhere from 30 to unlimited, and it doesn't seem to make a difference.
According to the logs, the PHP-FPM processes are not exiting with SIGSEGV or SIGBUS, but rather with SIGTERM. I get a lot of lines like:
WARNING: [pool www] child 3739, script '/var/www/drupal6/index.php' (request: "GET /index.php") execution timed out (38.739494 sec), terminating
and:
WARNING: [pool www] child 3738 exited on signal 15 (SIGTERM) after 50.004380 seconds from start
I actually found several articles that recommend doing a graceful reload of PHP-FPM via cron every few minutes or hours to circumvent this issue. So that's what I did, "/etc/init.d/php-fpm reload" every 5 minutes. So far, it's keeping the lights on. But it feels like a dreadful hack. Is PHP-FPM really that unreliable? Is there anything else I can do? Thanks a lot!