webserver horrible slow, sometimes incredible fast

Posted by dhanke on Server Fault See other posts from Server Fault or by dhanke
Published on 2012-07-05T01:36:54Z Indexed on 2012/07/05 3:17 UTC
Read the original article Hit count: 594

Filed under:
|
|
|
|

i am running a small community ( 6000+ Members ) on a non-virtual 64-bit ubuntu 11.04 system.

I am not a Linux-pro, not even advanced, i just tried to setup a webserver, which does nothing special actually. Delivering some dynamic PHP and RoR websites is its task. So it might be that my configuration files do look horrible bad. Also, i might use the wrong vocabulary, so in doubt, please ask.

Having a current all-time record of 520 registered users (board-accounts, no system-users) online at same time, average server-load is about 2.0 - 5.0. Meantime (~250 users) average server load value is at about 0.4 - 0.8, sometimes, on some expensive searches a bit higher. everything fine.

From time to time however, the load increases up to 120 (120.0, not 12.0 ;) ). In this time, its hard to even connect via SSH, but when i reach the server, and use top/htop/iotop to see whats happening, i cannot identify any process causing high CPU load.

iotop tells me about a current reading/writing speed of about approx. 70kb/s, which is quite equal to power-off i think.

Memory-Usage is max. at ~ 12GB of 16GB, so swap remains empty.

now the odd (at least for me:)

waiting some minutes ( since i always get a bit into a panic when this happens, it feels like 5 minutes, but i suppose its more like 20-30 minutes) and the server is back to normal. everything continues as normal.

another odd fact:

when i run hdparm -tT /dev/sda, i get answer like:

/dev/sda:
  Timing cached reads:   7180 MB in  2.00 seconds = 3591.13 MB/sec
  Timing buffered disk reads: 348 MB in  3.02 seconds = 115.41 MB/sec

when i run the same command while the server is "frozen", the answer is like

/dev/sda:  <- takes about 5 minutes until this line appears
  Timing cached reads:   7180 MB in  2.00 seconds = 3591.13 MB/sec <- 5 more minutes
  Timing buffered disk reads: 348 MB in  3.02 seconds = 115.41 MB/sec <- another 5 minutes

so the values are the same, but the quoted time is completely wrong. using time command as prefix also tells me that ~ 15 minutes were used.

I searched in dmesg, /var/log/[messages|syslog] - nothing found.

/var/log/errors however tells me that:

Jul  4 20:28:30 localhost kernel: [19080.671415] INFO: task php5-fpm:27728 blocked for more than 120 seconds.
Jul  4 20:28:30 localhost kernel: [19080.671419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

multiple times. now that message does tell me that php5-fpm task was blocked or did block ? - but not if that is the cause or just one of the results of that "freeze". Anyone?

to cut the long story short, i dont know where even to start analyzing. So if you can give me any advice by looking at following specs and configs, or ask me to provide more information, i`d be glad.

Specs:

    6 Core AMD Phenom(tm) II X6 1055T Processor *
    16 Gigabyte Ram
    2x 1.5 TB Seagate ST1500DL003-9VT16L via SATA 3 via SoftwareRaid (i suppose)

Services: (due to service --status-all, those with [ + ]) 

    nginx Webserver 1.0.14
    mySQL 5.1.63 Server 
    Ruby on Rails 2.3.11 ( passenger-nginx-module )
    php5-fpm 5.3.6-13ubuntu3.7 
    SSH
    ido2db


Further services:

     default crontab + nightly backup.
     syslog-ng

Website consists of 2 subdomains, forum. and www. where forum is a phpBB3.x PHP-Board, and www a Ruby on Rails 2.3.11 application (portal).

Mini-Note: sometimes i notice that the forum is pretty slow, in contrast to the always-fast (except for this "freeze") portal. Both share the same Database, but the portal is using it read-only.

The Webserver is nginx, using phusion passenger module to communicate with the ruby-application. Also, for the forum it communicates with php5-fpm via socket:

relevant nginx configuration parts ( with comments/questions starting by ; )

; in case of freeze due to too high Filesystem activity, maybe adding a limit?
#worker_rlimit_nofile 50000;
user  www-data;
; 6 cores, so i read 6 fits. maybe already wrong?
worker_processes  6;  
pid /var/run/nginx.pid;
events {
        worker_connections  1024;
}


http {
        passenger_root /var/lib/gems/1.8/gems/passenger-3.0.11;
        passenger_ruby /usr/bin/ruby1.8;

; the forum once featured a chat, which was working w/o websockets. 
; so it was a hell of pull requests (deactivated now, freeze still happening)
        keepalive_timeout  65;
        keepalive_requests 50;
        gzip  on;

        server {
                listen 80;
                server_name www.domain.tld;
                root /var/www/domain/rails/public;
                passenger_enabled on;
        }

        server {
                listen     80;
                server_name  forum.domain.tld;

                location / {
                        root   /var/www/domain/forum;
                        index  index.php;
                }
; satic stuff to be handled by nginx
                location ~* ^/style/.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ {
                        access_log              off;
                        expires            30d;
                        root /var/www/domain/forum/;
                }

; now the php magic, note the "backend"-fcgi_pass
                location ~ .php$ {
                        fastcgi_split_path_info ^(.+\.php)(.*)$;
                        fastcgi_pass   backend;
                        fastcgi_index  index.php;
                        fastcgi_param  SCRIPT_FILENAME  /var/www/domain/forum$fastcgi_script_name;
                        include fastcgi_params;
                        fastcgi_param  QUERY_STRING      $query_string;
                        fastcgi_param  REQUEST_METHOD   $request_method;
                        fastcgi_param  CONTENT_TYPE      $content_type;
                        fastcgi_param  CONTENT_LENGTH   $content_length;
                        fastcgi_intercept_errors                on;
                        fastcgi_ignore_client_abort      off;
                        fastcgi_connect_timeout 60;
                        fastcgi_send_timeout 180;
                        fastcgi_read_timeout 180;
                        fastcgi_buffer_size 128k;
                        fastcgi_buffers 256 16k;
                        fastcgi_busy_buffers_size 256k;
                        fastcgi_temp_file_write_size 256k;
                        fastcgi_max_temp_file_size 0;
                }

                location ~ /\.ht {
                        deny  all;
                }

        }

;the php5-fpm socket. i read that /dev/shm/ whould be the fastes place for this. bad idea in general?
        upstream backend {
                server unix:/dev/shm/phpfpm;
        }
       ...
}

php5-fpm settings (i changed this values due to php5-fpm error log messages higher and higher.. (freeze-problem was there before as well)*


listen = /dev/shm/phpfpm 
user = www-data
group = www-data
pm = dynamic


; holy, 4000! well, shinking this value to earth-level gave me 
; 100s of 502 bad gateway commands. this values were quite stable.
; since there are only max 520 users online i dont get it, why i would need
; as many children as configured here. due to keep-alive maybe?
; asking questions is easier for me since restarting server will make
; my community-members angry ;)
pm.max_children      = 4000 
pm.start_servers     = 100
pm.min_spare_servers = 50 
pm.max_spare_servers = 150 
pm.max_requests      = 10

pm.status_path = /status
ping.path = /ping
ping.response = pong
slowlog = log/$pool.log.slow

;should i use rlimit?
;rlimit_files = 1024

chdir = /

mysql/my.cnf

[client]
port        = 3306
socket      = /var/run/mysqld/mysqld.sock

[mysqld_safe]
socket      = /var/run/mysqld/mysqld.sock
nice        = 0

[mysqld]
user        = mysql
socket      = /var/run/mysqld/mysqld.sock
port        = 3306
basedir     = /usr
datadir     = /var/lib/mysql
tmpdir      = /tmp
skip-external-locking
bind-address        = 127.0.0.1
key_buffer      = 16M
max_allowed_packet  = 16M
thread_stack        = 192K
thread_cache_size       = 8
myisam-recover         = BACKUP

; high number, but less gives some phpBB errors.
max_connections        = 450
table_cache            = 512

; i read twice the cpu cores, bad?
thread_concurrency     = 12 
join_buffer_size       = 2084K
concurrent_insert      = 3
query_cache_limit   = 64M
query_cache_size        = 512M
query_cache_type    = 1

log_error                = /var/log/mysql/error.log
log_slow_queries    = /var/log/mysql/mysql-slow.log
long_query_time = 2
expire_logs_days    = 10
max_binlog_size         = 100M
low_priority_updates=1

[mysqldump]
quick
quote-names
max_allowed_packet  = 16M

[isamchk]
key_buffer      = 16M
!includedir /etc/mysql/conf.d/

I used smartctl already, hdds seem to be fine. /proc/mdstatus quotes:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md3 : active raid1 sda3[1]
      1459264192 blocks [2/1] [_U]

md1 : active raid1 sda1[0]
      3911680 blocks [2/1] [U_]

unused devices: 

ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127727
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127727
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I quote some questions in my configuration files, these are not (intentional) directly problem-related, but would be nice for me to know wether they are indeed questionable or done right.

One additional Fact: my MYSQL-database is at 12GB size.

i dont know if that does matter, but mytop sometimes shows me 4-5 seconds long insert queries, some are 20-30 seconds long. Its just a feeling that i am unable to prove (because i dont know how), but when i disable the database, the freeze seems not to happen.
Example:

i created a dummy rails application to see the development log. the app made some sql-queries, reads and inserts.

the log quite often was like:

 DbTest Load (0.3ms)   SELECT * FROM `db_test` WHERE (`db_test`.`id` = 31722) LIMIT 1
 SQL (0.1ms)   BEGIN
 DbTest Update (0.3ms)   UPDATE `db_test` SET `updated_at` = '2012-07-04 23:32:34' WHERE `id` = 31722

 - now the log stands still for 5-60 seconds.

 SQL (49.1ms)   COMMIT

 - SQL-Update time in the log does not include freeze time

Rendering test/index
Completed in 96ms (View: 16, DB: 59) | 200 OK [http://localhost:9000/test]

Bad part is: this mini-freeze here only happens from time to time as well. note: meanwhile i cannot even upload files via scp.

I currently feel like running form bad to worse and back by googling for my server-problem due to immense lack of knowledge regarding server configurations. It still makes me wonder, why those problems even appear, since 250 users a time is not such a high amount, right?

So my questions:

  • whats wrong and how to fix? ;) or:

  • what information can i provide to make the situation more clear?

  • can you point at some critical bad configuration-line which i should consider to catch up in the documentation?
  • are there any tools i can run to see some possible bottlenecks?
  • any further advice? (next to: "pay someone who knows what he does" - its a private project, server costs enough already. :))

Thanks for your time and help.

Best Regards, Daniel

P.S.: i renamed the configfiles to domain.tld since i dont want to have any % more load to the server until its fixed. might be a exaggeratedly thought..

P.P.S: if i asked a complete duplicate question, sorry. my search results seemed to be quite specific in their own way.

© Server Fault or respective owner

Related posts about ubuntu

Related posts about mysql