What's going on with my server? High load, lots of idle CPU time, low disk utilization
- by Jonathan
I run a web site and send a legitimate opt-in, daily email newsletter to subscribers. Both the web hosting and email sending are done by the same machine.
I have about 100,000 subscribers who have opted in to my daily email newsletter. My PHP script did a pretty good job sending mail to all of them until fairly recently, but as the list has grown I can't keep up.
When I run top, I have very high load--usually at least 6 or 7, sometimes as high as 15--even though I only have two CPUs. However, when I run sar, my CPU is idle an average of about 30% of the time. So, it seems I'm not CPU bound. When I run iostat, it seems as though I'm not disk bound because my %util for each device is very low (no more than 5%).
Given that I don't seem to be CPU bound or disk bound, why is top reporting such high load?
Additionally, since I don't seem to be CPU bound or disk bound, why is my email sending script not able to keep up?
Here's what I see when running top:
top - 11:33:28 up 74 days, 18:49, 2 users, load average: 7.65, 8.79, 8.28
Tasks: 168 total, 5 running, 162 sleeping, 0 stopped, 1 zombie
Cpu(s): 38.9%us, 58.6%sy, 0.8%ni, 0.0%id, 0.7%wa, 0.2%hi, 0.8%si, 0.0%st
Mem: 3083012k total, 2144436k used, 938576k free, 281136k buffers
Swap: 2048248k total, 39164k used, 2009084k free, 1470412k cached
Here's what I see when running iostat -mx:
avg-cpu: %user %nice %system %iowait %steal %idle
34.80 1.20 55.24 0.37 0.00 8.38
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.19 71.70 1.59 29.45 0.02 0.07 5.90 0.55 17.82 1.16 3.59
sda1 0.00 0.00 0.00 0.00 0.00 0.00 7.10 0.00 13.80 13.72 0.00
sda2 0.05 50.45 1.13 24.57 0.01 0.29 24.25 0.35 13.43 1.15 2.97
sda3 0.05 10.17 0.20 2.33 0.01 0.05 43.75 0.05 20.96 2.45 0.62
sda4 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 70.50 70.50 0.00
sda5 0.07 0.22 0.03 0.07 0.00 0.00 32.84 0.08 856.19 8.03 0.08
sda6 0.02 5.45 0.03 0.72 0.00 0.02 67.55 0.02 26.72 5.26 0.39
sda7 0.00 1.56 0.00 0.42 0.00 0.01 38.04 0.00 8.88 5.84 0.24
sda8 0.01 3.84 0.20 1.35 0.00 0.02 28.55 0.05 31.90 4.08 0.63
Here's what I see when running sar:
09:40:02 AM CPU %user %nice %system %iowait %steal %idle
09:50:01 AM all 30.59 1.01 49.80 0.23 0.00 18.37
10:00:08 AM all 31.73 0.92 51.66 0.13 0.00 15.55
10:10:06 AM all 30.43 0.99 48.94 0.26 0.00 19.38
10:20:01 AM all 29.58 1.00 47.76 0.25 0.00 21.42
10:30:01 AM all 29.37 1.02 47.30 0.18 0.00 22.13
10:40:06 AM all 32.50 1.01 52.94 0.16 0.00 13.39
10:50:01 AM all 30.49 1.00 49.59 0.15 0.00 18.77
11:00:01 AM all 29.43 0.99 47.71 0.17 0.00 21.71
11:10:07 AM all 30.26 0.93 49.48 0.83 0.00 18.50
11:20:02 AM all 29.83 0.81 48.51 1.32 0.00 19.52
11:30:06 AM all 31.18 0.88 51.33 1.15 0.00 15.47
Average: all 26.21 1.15 42.62 0.48 0.00 29.54
Here are the top handful of processes listed at the particular time I happened to run top -c:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8180 mysql 16 0 57448 19m 2948 S 26.6 0.7 4702:26 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/bristno.pid --skip-external-locking
26956 brristno 17 0 0 0 0 Z 8.0 0.0 0:00.24 [php] <defunct>
26958 brristno 17 0 94408 43m 37m R 5.0 1.4 0:00.15 /usr/bin/php /home/brristno/public_html/dbv.php
22852 nobody 16 0 9628 2900 1524 S 0.7 0.1 0:00.17 /usr/local/apache/bin/httpd -k start -DSSL
8591 brristno 34 19 96896 13m 6652 S 0.3 0.4 0:29.82 /usr/local/bin/php /home/brristno/bin/mailer.php 1qwqyb6 i0gbor
24469 nobody 16 0 9628 2880 1508 S 0.3 0.1 0:00.08 /usr/local/apache/bin/httpd -k start -DSSL
25495 nobody 15 0 9628 2876 1500 S 0.3 0.1 0:00.06 /usr/local/apache/bin/httpd -k start -DSSL
26149 nobody 15 0 9628 2864 1504 S 0.3 0.1 0:00.04 /usr/local/apache/bin/httpd -k start -DSSL