webserver horrible slow, sometimes incredible fast
- by dhanke
i am running a small community ( 6000+ Members ) on a non-virtual 64-bit ubuntu 11.04 system.
I am not a Linux-pro, not even advanced, i just tried to setup a webserver, which does nothing special actually. Delivering some dynamic PHP and RoR websites is its task. So it might be that my configuration files do look horrible bad. Also, i might use the wrong vocabulary, so in doubt, please ask.
Having a current all-time record of 520 registered users (board-accounts, no system-users) online at same time, average server-load is about 2.0 - 5.0.
Meantime (~250 users) average server load value is at about 0.4 - 0.8, sometimes, on some expensive searches a bit higher. everything fine.
From time to time however, the load increases up to 120 (120.0, not 12.0 ;) ).
In this time, its hard to even connect via SSH, but when i reach the server, and use top/htop/iotop to see whats happening, i cannot identify any process causing high CPU load.
iotop tells me about a current reading/writing speed of about approx. 70kb/s, which is quite equal to power-off i think.
Memory-Usage is max. at ~ 12GB of 16GB, so swap remains empty.
now the odd (at least for me:)
waiting some minutes ( since i always get a bit into a panic when this happens, it feels like 5 minutes, but i suppose its more like 20-30 minutes) and the server is back to normal. everything continues as normal.
another odd fact:
when i run hdparm -tT /dev/sda, i get answer like:
/dev/sda:
Timing cached reads: 7180 MB in 2.00 seconds = 3591.13 MB/sec
Timing buffered disk reads: 348 MB in 3.02 seconds = 115.41 MB/sec
when i run the same command while the server is "frozen", the answer is like
/dev/sda: <- takes about 5 minutes until this line appears
Timing cached reads: 7180 MB in 2.00 seconds = 3591.13 MB/sec <- 5 more minutes
Timing buffered disk reads: 348 MB in 3.02 seconds = 115.41 MB/sec <- another 5 minutes
so the values are the same, but the quoted time is completely wrong. using time command as prefix also tells me that ~ 15 minutes were used.
I searched in dmesg, /var/log/[messages|syslog] - nothing found.
/var/log/errors however tells me that:
Jul 4 20:28:30 localhost kernel: [19080.671415] INFO: task php5-fpm:27728 blocked for more than 120 seconds.
Jul 4 20:28:30 localhost kernel: [19080.671419] "echo 0 /proc/sys/kernel/hung_task_timeout_secs" disables this message.
multiple times. now that message does tell me that php5-fpm task was blocked or did block ? - but not if that is the cause or just one of the results of that "freeze". Anyone?
to cut the long story short, i dont know where even to start analyzing.
So if you can give me any advice by looking at following specs and configs, or ask me to provide more information, i`d be glad.
Specs:
6 Core AMD Phenom(tm) II X6 1055T Processor *
16 Gigabyte Ram
2x 1.5 TB Seagate ST1500DL003-9VT16L via SATA 3 via SoftwareRaid (i suppose)
Services: (due to service --status-all, those with [ + ])
nginx Webserver 1.0.14
mySQL 5.1.63 Server
Ruby on Rails 2.3.11 ( passenger-nginx-module )
php5-fpm 5.3.6-13ubuntu3.7
SSH
ido2db
Further services:
default crontab + nightly backup.
syslog-ng
Website consists of 2 subdomains, forum. and www. where forum is a phpBB3.x PHP-Board, and www a Ruby on Rails 2.3.11 application (portal).
Mini-Note: sometimes i notice that the forum is pretty slow, in contrast to the always-fast (except for this "freeze") portal. Both share the same Database, but the portal is using it read-only.
The Webserver is nginx, using phusion passenger module to communicate with the ruby-application. Also, for the forum it communicates with php5-fpm via socket:
relevant nginx configuration parts ( with comments/questions starting by ; )
; in case of freeze due to too high Filesystem activity, maybe adding a limit?
#worker_rlimit_nofile 50000;
user www-data;
; 6 cores, so i read 6 fits. maybe already wrong?
worker_processes 6;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
passenger_root /var/lib/gems/1.8/gems/passenger-3.0.11;
passenger_ruby /usr/bin/ruby1.8;
; the forum once featured a chat, which was working w/o websockets.
; so it was a hell of pull requests (deactivated now, freeze still happening)
keepalive_timeout 65;
keepalive_requests 50;
gzip on;
server {
listen 80;
server_name www.domain.tld;
root /var/www/domain/rails/public;
passenger_enabled on;
}
server {
listen 80;
server_name forum.domain.tld;
location / {
root /var/www/domain/forum;
index index.php;
}
; satic stuff to be handled by nginx
location ~* ^/style/.+.(jpg|jpeg|gif|css|png|js|ico|xml)$ {
access_log off;
expires 30d;
root /var/www/domain/forum/;
}
; now the php magic, note the "backend"-fcgi_pass
location ~ .php$ {
fastcgi_split_path_info ^(.+\.php)(.*)$;
fastcgi_pass backend;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME /var/www/domain/forum$fastcgi_script_name;
include fastcgi_params;
fastcgi_param QUERY_STRING $query_string;
fastcgi_param REQUEST_METHOD $request_method;
fastcgi_param CONTENT_TYPE $content_type;
fastcgi_param CONTENT_LENGTH $content_length;
fastcgi_intercept_errors on;
fastcgi_ignore_client_abort off;
fastcgi_connect_timeout 60;
fastcgi_send_timeout 180;
fastcgi_read_timeout 180;
fastcgi_buffer_size 128k;
fastcgi_buffers 256 16k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
fastcgi_max_temp_file_size 0;
}
location ~ /\.ht {
deny all;
}
}
;the php5-fpm socket. i read that /dev/shm/ whould be the fastes place for this. bad idea in general?
upstream backend {
server unix:/dev/shm/phpfpm;
}
...
}
php5-fpm settings (i changed this values due to php5-fpm error log messages higher and higher.. (freeze-problem was there before as well)*
listen = /dev/shm/phpfpm
user = www-data
group = www-data
pm = dynamic
; holy, 4000! well, shinking this value to earth-level gave me
; 100s of 502 bad gateway commands. this values were quite stable.
; since there are only max 520 users online i dont get it, why i would need
; as many children as configured here. due to keep-alive maybe?
; asking questions is easier for me since restarting server will make
; my community-members angry ;)
pm.max_children = 4000
pm.start_servers = 100
pm.min_spare_servers = 50
pm.max_spare_servers = 150
pm.max_requests = 10
pm.status_path = /status
ping.path = /ping
ping.response = pong
slowlog = log/$pool.log.slow
;should i use rlimit?
;rlimit_files = 1024
chdir = /
mysql/my.cnf
[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
user = mysql
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
skip-external-locking
bind-address = 127.0.0.1
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
myisam-recover = BACKUP
; high number, but less gives some phpBB errors.
max_connections = 450
table_cache = 512
; i read twice the cpu cores, bad?
thread_concurrency = 12
join_buffer_size = 2084K
concurrent_insert = 3
query_cache_limit = 64M
query_cache_size = 512M
query_cache_type = 1
log_error = /var/log/mysql/error.log
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 2
expire_logs_days = 10
max_binlog_size = 100M
low_priority_updates=1
[mysqldump]
quick
quote-names
max_allowed_packet = 16M
[isamchk]
key_buffer = 16M
!includedir /etc/mysql/conf.d/
I used smartctl already, hdds seem to be fine.
/proc/mdstatus quotes:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md3 : active raid1 sda3[1]
1459264192 blocks [2/1] [_U]
md1 : active raid1 sda1[0]
3911680 blocks [2/1] [U_]
unused devices:
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127727
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 127727
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I quote some questions in my configuration files, these are not (intentional) directly problem-related, but would be nice for me to know wether they are indeed questionable or done right.
One additional Fact: my MYSQL-database is at 12GB size.
i dont know if that does matter, but mytop sometimes shows me 4-5 seconds long insert queries, some are 20-30 seconds long. Its just a feeling that i am unable to prove (because i dont know how), but when i disable the database, the freeze seems not to happen.
Example:
i created a dummy rails application to see the development log.
the app made some sql-queries, reads and inserts.
the log quite often was like:
DbTest Load (0.3ms) SELECT * FROM `db_test` WHERE (`db_test`.`id` = 31722) LIMIT 1
SQL (0.1ms) BEGIN
DbTest Update (0.3ms) UPDATE `db_test` SET `updated_at` = '2012-07-04 23:32:34' WHERE `id` = 31722
- now the log stands still for 5-60 seconds.
SQL (49.1ms) COMMIT
- SQL-Update time in the log does not include freeze time
Rendering test/index
Completed in 96ms (View: 16, DB: 59) | 200 OK [http://localhost:9000/test]
Bad part is: this mini-freeze here only happens from time to time as well.
note: meanwhile i cannot even upload files via scp.
I currently feel like running form bad to worse and back by googling for my server-problem due to immense lack of knowledge regarding server configurations.
It still makes me wonder, why those problems even appear, since 250 users a time is not such a high amount, right?
So my questions:
whats wrong and how to fix? ;) or:
what information can i provide to make the situation more clear?
can you point at some critical bad configuration-line which i should consider to catch up in the documentation?
are there any tools i can run to see some possible bottlenecks?
any further advice? (next to: "pay someone who knows what he does" - its a private project, server costs enough already. :))
Thanks for your time and help.
Best Regards,
Daniel
P.S.: i renamed the configfiles to domain.tld since i dont want to have any % more load to the server until its fixed. might be a exaggeratedly thought..
P.P.S: if i asked a complete duplicate question, sorry. my search results seemed to be quite specific in their own way.