I have a custom built Ubuntu 11.04 server with a 6 disk software RAID 10 primary drive. On it I'm primarily running a PostgreSQL and a few other utilities that stream data from the web. I often find after a few hours of uptime the server starts to lag with all kinds of processes. For example, it may take 10-15 seconds after log-in to get a shell prompt. It might take 5-10 seconds for top to come up. An ls might take a second or two.
When I look at top there is almost no CPU usage. There's a fair amount of memory used by the PostgreSQL server but not enough to bleed into
swap.
I have no idea where to go from here, other than to suspect the RAID10 (I've only ever had software RAID 1's before).
Edit: Output from top:
top - 11:56:03 up 1:46, 3 users, load average: 0.89, 0.73, 0.72
Tasks: 119 total, 1 running, 118 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 93.5%id, 6.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16325596k total, 3478248k used, 12847348k free, 20880k buffers
Swap: 19534176k total, 0k used, 19534176k free, 3041992k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1747 woodsp 20 0 109m 10m 4888 S 1 0.1 0:42.70 python
357 root 20 0 0 0 0 S 0 0.0 0:00.40 jbd2/sda3-8
1 root 20 0 24324 2284 1344 S 0 0.0 0:00.84 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:00.24 ksoftirqd/0
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
7 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/0
8 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
10 root 20 0 0 0 0 S 0 0.0 0:00.02 ksoftirqd/1
12 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/1
13 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/2
14 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/2:0
15 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2
16 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/2
17 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/3
18 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/3:0
19 root 20 0 0 0 0 S 0 0.0 0:00.02 ksoftirqd/3
20 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/3
21 root 0 -20 0 0 0 S 0 0.0 0:00.00 cpuset
22 root 0 -20 0 0 0 S 0 0.0 0:00.00 khelper
23 root 20 0 0 0 0 S 0 0.0 0:00.00 kdevtmpfs
24 root 0 -20 0 0 0 S 0 0.0 0:00.00 netns
26 root 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers
df -h
rpsharp@ncp-skookum:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 1.8T 549G 1.2T 32% /
udev 7.8G 4.0K 7.8G 1% /dev
tmpfs 3.2G 492K 3.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.8G 0 7.8G 0% /run/shm
/dev/sda2 952M 128K 952M 1% /boot/efi
/dev/md0 5.5T 562G 4.7T 11% /usr/local
free -m
psharp@ncp-skookum:~$ free -m
total used free shared buffers cached
Mem: 15942 3409 12533 0 20 2983
-/+ buffers/cache: 405 15537
Swap: 19076 0 19076
tail -50 /var/log/syslog
Jul 3 06:31:32 ncp-skookum rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="1070" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jul 3 06:39:01 ncp-skookum CRON[14211]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
Jul 3 06:40:01 ncp-skookum CRON[14223]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Jul 3 07:00:01 ncp-skookum CRON[14328]: (woodsp) CMD (/home/woodsp/bin/mail_tweetupdate # email an update)
Jul 3 07:00:01 ncp-skookum CRON[14327]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Jul 3 07:00:28 ncp-skookum sendmail[14356]: q63E0SoZ014356: from=woodsp, size=2328, class=0, nrcpts=2, msgid=<
[email protected]>, relay=woodsp@localhost
Jul 3 07:00:29 ncp-skookum sm-mta[14357]: q63E0Si6014357: from=<
[email protected]>, size=2569, class=0, nrcpts=2, msgid=<
[email protected]>, proto=ESMTP, daemon=MTA-v4, relay=localhost [127.0.0.1]
Jul 3 07:00:29 ncp-skookum sendmail[14356]: q63E0SoZ014356: to=Spencer Wood <
[email protected]>,Martin Lacayo <
[email protected]>, ctladdr=woodsp (1004/1005), delay=00:00:01, xdelay=00:00:01, mailer=relay, pri=62328, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (q63E0Si6014357 Message accepted for delivery)
Jul 3 07:00:29 ncp-skookum sm-mta[14359]: STARTTLS=client, relay=mx3.stanford.edu., version=TLSv1/SSLv3, verify=FAIL, cipher=DHE-RSA-AES256-SHA, bits=256/256
Jul 3 07:00:29 ncp-skookum sm-mta[14359]: q63E0Si6014357: to=<
[email protected]>,<
[email protected]>, ctladdr=<
[email protected]> (1004/1005), delay=00:00:01, xdelay=00:00:00, mailer=esmtp, pri=152569, relay=mx3.stanford.edu. [171.67.219.73], dsn=2.0.0, stat=Sent (Ok: queued as 8F3505802AC)
Jul 3 07:09:08 ncp-skookum CRON[14396]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
Jul 3 07:17:01 ncp-skookum CRON[14438]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jul 3 07:20:01 ncp-skookum CRON[14453]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Jul 3 07:39:01 ncp-skookum CRON[14551]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
Jul 3 07:40:01 ncp-skookum CRON[14562]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Jul 3 08:00:01 ncp-skookum CRON[14668]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Jul 3 08:09:01 ncp-skookum CRON[14724]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
Jul 3 08:17:01 ncp-skookum CRON[14766]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jul 3 08:20:01 ncp-skookum CRON[14781]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Jul 3 08:39:01 ncp-skookum CRON[14881]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
Jul 3 08:40:01 ncp-skookum CRON[14892]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)
Output of hdparm -t /dev/sd{a,b,c,d,e,f} This looks suspicious?
/dev/sda:
Timing buffered disk reads: 2 MB in 4.84 seconds = 423.39 kB/sec
/dev/sdb:
Timing buffered disk reads: 420 MB in 3.01 seconds = 139.74 MB/sec
/dev/sdc:
Timing buffered disk reads: 390 MB in 3.00 seconds = 129.87 MB/sec
/dev/sdd:
Timing buffered disk reads: 416 MB in 3.00 seconds = 138.51 MB/sec
/dev/sde:
Timing buffered disk reads: 422 MB in 3.00 seconds = 140.50 MB/sec
/dev/sdf:
Timing buffered disk reads: 416 MB in 3.01 seconds = 138.26 MB/sec