iowait - Developer IT

What iowait values are ok?

- by alfish

I am trying to find the bottleneck of a server running a fairly busy php/mysql site. My first culprit was io but iostat shows that on average iowait consumes only %3.60 of cpu time. here is the complete result of issuing iostat: avg-cpu: %user %nice %system %iowait %steal %idle 65.78 0.00 8.52 3.60 0.00 22.10 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 42.36 138.28 1754.70 408630278 5185216128 So I am wondering if the iowait is within acceptable range, and if not, whether switching from SATA to SSD would dramatically reduce it?

Read the article

Ubuntu lucid, maverick high iowait

- by netom

I'm using Ubuntu, and I've got the same problem with Lucid and Maverick. From time to time, especially a few minutes after boot, the iowait goes between 50-100% and the box is unusable. Everything that tries to access the disk freezes. I have the following setup: Hard disk: Model Family: Western Digital Caviar Green family Device Model: WDC WD15EADS-00P8B0 Serial Number: WD-WMAVU0391287 Firmware Version: 01.00A01 User Capacity: 1.500.301.910.016 bytes I have a quad core Intel Core2 Q6600 processor, and 4G of memory. When the high iowait occurs, usually 4 processes are active: kdmflush (two procs) jbd2/dm-0-8 jbd2/db-1-8 and a few more starving user processes of course. I know this from top and iotop. Any suggestions about why this is happening? There are a lot of q/a-s about linux and high iowait, but none of them helped so far, I even tweaked the hard disk not to park the head in every 8 seconds (Load cycle count is 50334!!!!! :o ), but nothing. Problem persists. Thank you in advance.

Read the article

How to manage iowait over cifs?

- by Silvia

For backup purposes we have Cifs file Server running that contains encrypted containers for backing up the more sensitive data. The container is mounted with cryptsetup and loop as a local filesystem and the rsync is used for backups. Because the Cifs server is not the fastest machine ever built, running the rsync process results in an iowait on the servers running the backup which in turn drives Nagios into an email frenzy. The question is, how do reduce the iowait on the server? Configuring Nagios to not report seems more like a workaround then a solution. Stretching the backups over different time intervals is already done with little effect and spending money is also not an option because apparently, we are talking about a "non-critical system".

Read the article

no disk io but iowait very high

- by Dan

there is no disk io going results of iotop Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO< COMMAND 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init [3] 1930 be/4 named 0.00 B/s 0.00 B/s 0.00 % 0.00 % named -u ~d/run-root 1931 be/4 named 0.00 B/s 0.00 B/s 0.00 % 0.00 % named -u ~d/run-root 1932 be/4 named 0.00 B/s 0.00 B/s 0.00 % 0.00 % named -u ~d/run-root 1933 be/4 named 0.00 B/s 0.00 B/s 0.00 % 0.00 % named -u ~d/run-root 1810 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % sh /usr/b~user=mysql 9795 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd 8004 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd 3226 be/4 postfix 0.00 B/s 0.00 B/s 0.00 % 0.00 % tlsmgr -l -t unix -u 8154 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd 9759 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % find -name php.ini 9249 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd 1756 be/4 postfix 0.00 B/s 0.00 B/s 0.00 % 0.00 % psa-pc-re~@localhost 1863 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.00 % mysqld --~mysql.sock 3123 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % crond 1758 be/4 postfix 0.00 B/s 0.00 B/s 0.00 % 0.00 % psa-pc-re~@localhost 1865 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.00 % mysqld --~mysql.sock 1592 be/4 sw-cp-se 0.00 B/s 0.00 B/s 0.00 % 0.00 % sw-cp-ser~ver/config 7612 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % sshd: root@pts/0 7614 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % sftp-server 7615 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % -bash 1602 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % sshd 8003 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd but iowait very high ? iostat report avg-cpu: %user %nice %system %iowait %steal %idle 0.83 0.00 0.13 13.83 0.00 85.20 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn server runs like a snail what could be wrong here ? thanks

Read the article

100% iowait + drive faults in dmesg

- by w00t

Hi, I have a server on which resides a fairly visited web app. It has a raid1 of 2 HDDs, 64MB Buffer, 7200 RPM. Today it started throwing out errors like: kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) kernel: ata2.00: status: { DRDY } kernel: ata2: hard resetting link kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: ata2.00: max_sectors limited to 256 for NCQ kernel: ata2.00: max_sectors limited to 256 for NCQ kernel: ata2.00: configured for UDMA/133 kernel: sd 1:0:0:0: timing out command, waited 7s kernel: ata2: EH complete kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB) kernel: sda: Write Protect is off kernel: SCSI device sda: drive cache: write back All day long it has been in load higher than 10-15. I am monitoring it with atop and it gives some bizarre readings: DSK | sda | busy 100% | read 2 | write 208 | KiB/r 16 | KiB/w 32 | MBr/s 0.00 | MBw/s 0.65 | avq 86.17 | avio 47.6 ms | DSK | sdb | busy 1% | read 10 | write 117 | KiB/r 17 | KiB/w 5 | MBr/s 0.02 | MBw/s 0.07 | avq 4.86 | avio 1.04 ms | I frankly don't understand why only sda is taking all the hit. I do have one process that is constantly writing with 1-2megs but what the hell.. 100% iowait?

Read the article

High IOWait executing JBoss 3.2.7

- by user64205

Server Details: Kernel: Linux wiq31 2.4.21-9.ELsmp #1 SMP Thu Jan 8 17:08:56 EST 2004 i686 i686 i386 GNU/Linux CPU: 4 x Intel(R) Xeon(TM) CPU 3.06GHz Memory: 1028520 kB JBoss version: 3.2.7 Every time i try to start JBoss, in all CPU's, the iowait values starts to raise and the idle values starts to fall. Before executing my JBoss application, the free command returns the following output: *total used free shared buffers cached Mem: 1028520 966400 62120 0 187756 538928 -/+ buffers/cache: 239716 788804 Swap: 2044072 790672 1253400* After starting my JBoss application, the free command returns the following output: *total used free shared buffers cached Mem: 1028520 1007648 20872 0 187116 524084 -/+ buffers/cache: 296448 732072 Swap: 2044072 819096 1224976* After starting my JBoss application, without answering any request, the java process /proc/PID/status file have the following values: State: S (sleeping) SleepAVG: 27% Tgid: 24022 Pid: 24022 PPid: 21011 TracerPid: 0 Uid: 500 500 500 500 Gid: 500 500 500 500 FDSize: 256 Groups: 500 VmSize: 775200 kB VmLck: 0 kB VmRSS: 156752 kB VmData: 696752 kB VmStk: 36 kB VmExe: 21 kB VmLib: 710375 kB StaBrk: 0804f000 kB Brk: 095bb000 kB StaStk: bffff8c0 kB ExecLim: ffffffff Threads: 62 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 1000000180015ccf CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 Is this behavior being caused by memory swapping, or the short memory available in the server is enough to run my application?

Read the article

Why load increase ,ssd iops increase but cpu iowait decrease?

- by mq44944

There is a strange thing on my server which has a mysql running on it. The QPS is more than 4000 but TPS is less than 20. The server load is more than 80 and cpu usr is more than 86% but iowait is less than 8%. The disk iops is more than 16000 and util of disk is more than 99%. When the QPS decreases, the load decreases, the cpu iowait increases. I can't catch this! root@mypc # dmidecode | grep "Product Name" Product Name: PowerEdge R510 Product Name: 084YMW root@mypc # megacli -PDList -aALL |grep "Inquiry Data" Inquiry Data: SEAGATE ST3600057SS ES656SL316PT Inquiry Data: SEAGATE ST3600057SS ES656SL30THV Inquiry Data: ATA INTEL SSDSA2CW300362CVPR201602A6300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR2044037K300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR204402PX300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR204403WN300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR202000HU300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR202001E7300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR204402WE300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR204404E5300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR204401QF300EGN Inquiry Data: ATA INTEL SSDSA2CW300362CVPR20450001300EGN the mysql data files lie on the ssd disks which are organizaed using RAID 10. root@mypc # megacli -LDInfo -L1 -a0 Adapter 0 -- Virtual Drive Information: Virtual Disk: 1 (Target Id: 1) Name: RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0 Size:1427840MB State: Optimal Stripe Size: 64kB Number Of Drives:2 Span Depth:5 Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Access Policy: Read/Write Disk Cache Policy: Disk's Default Exit Code: 0x00 -------- -----load-avg---- ---cpu-usage--- ---swap--- -------------------------io-usage----------------------- -QPS- -TPS- -Hit%- time | 1m 5m 15m |usr sys idl iow| si so| r/s w/s rkB/s wkB/s queue await svctm %util| ins upd del sel iud| lor hit| 09:05:29|79.80 64.49 42.00| 82 7 6 5| 0 0|16421.1 10.6262705.9 85.2 8.3 0.5 0.1 99.5| 0 0 0 3968 0| 495482 96.58| 09:05:30|79.80 64.49 42.00| 79 7 8 6| 0 0|15907.4 230.6254409.7 6357.5 8.4 0.5 0.1 98.5| 0 0 0 4195 0| 496434 96.68| 09:05:31|81.34 65.07 42.31| 81 7 7 5| 0 0|16198.7 8.6259029.2 99.8 8.1 0.5 0.1 99.3| 0 0 0 4220 0| 508983 96.70| 09:05:32|81.34 65.07 42.31| 82 7 5 5| 0 0|16746.6 8.7267853.3 92.4 8.5 0.5 0.1 99.4| 0 0 0 4084 0| 503834 96.54| 09:05:33|81.34 65.07 42.31| 81 7 6 5| 0 0|16498.7 9.6263856.8 92.3 8.0 0.5 0.1 99.3| 0 0 0 4030 0| 507051 96.60| 09:05:34|81.34 65.07 42.31| 80 8 7 6| 0 0|16328.4 11.5261101.6 95.8 8.1 0.5 0.1 98.3| 0 0 0 4119 0| 504409 96.63| 09:05:35|81.31 65.33 42.52| 82 7 6 5| 0 0|16374.0 8.7261921.9 92.5 8.1 0.5 0.1 99.7| 0 0 0 4127 0| 507279 96.66| 09:05:36|81.31 65.33 42.52| 81 8 6 5| 0 0|16496.2 8.6263832.0 84.5 8.5 0.5 0.1 99.2| 0 0 0 4100 0| 505054 96.59| 09:05:37|81.31 65.33 42.52| 82 8 6 4| 0 0|16239.4 9.6259768.8 84.3 8.0 0.5 0.1 99.1| 0 0 0 4273 0| 510621 96.72| 09:05:38|81.31 65.33 42.52| 81 7 6 5| 0 0|16349.6 8.7261439.2 81.4 8.2 0.5 0.1 98.9| 0 0 0 4171 0| 510145 96.67| 09:05:39|81.31 65.33 42.52| 82 7 6 5| 0 0|16116.8 8.7257667.6 96.5 8.0 0.5 0.1 99.1| 0 0 0 4348 0| 513093 96.74| 09:05:40|79.60 65.24 42.61| 79 7 7 7| 0 0|16154.2 242.9258390.4 6388.4 8.5 0.5 0.1 99.0| 0 0 0 4033 0| 507244 96.70| 09:05:41|79.60 65.24 42.61| 79 7 8 6| 0 0|16583.1 21.2265129.6 173.5 8.2 0.5 0.1 99.1| 0 0 0 3995 0| 501474 96.57| 09:05:42|79.60 65.24 42.61| 81 8 6 5| 0 0|16281.0 9.7260372.2 69.5 8.3 0.5 0.1 98.7| 0 0 0 4221 0| 509322 96.70| 09:05:43|79.60 65.24 42.61| 80 7 7 6| 0 0|16355.3 8.7261515.5 104.3 8.2 0.5 0.1 99.6| 0 0 0 4087 0| 502052 96.62| -------- -----load-avg---- ---cpu-usage--- ---swap--- -------------------------io-usage----------------------- -QPS- -TPS- -Hit%- time | 1m 5m 15m |usr sys idl iow| si so| r/s w/s rkB/s wkB/s queue await svctm %util| ins upd del sel iud| lor hit| 09:05:44|79.60 65.24 42.61| 83 7 5 4| 0 0|16469.4 11.6263387.0 138.8 8.2 0.5 0.1 98.7| 0 0 0 4292 0| 509979 96.65| 09:05:45|79.07 65.37 42.77| 80 7 6 6| 0 0|16659.5 9.7266478.7 85.0 8.4 0.5 0.1 98.5| 0 0 0 3899 0| 496234 96.54| 09:05:46|79.07 65.37 42.77| 78 7 7 8| 0 0|16752.9 8.7267921.8 97.1 8.4 0.5 0.1 98.9| 0 0 0 4126 0| 508300 96.57| 09:05:47|79.07 65.37 42.77| 82 7 6 5| 0 0|16657.2 9.6266439.3 84.3 8.3 0.5 0.1 98.9| 0 0 0 4086 0| 502171 96.57| 09:05:48|79.07 65.37 42.77| 79 8 6 6| 0 0|16814.5 8.7268924.1 77.6 8.5 0.5 0.1 99.0| 0 0 0 4059 0| 499645 96.52| 09:05:49|79.07 65.37 42.77| 81 7 6 5| 0 0|16553.0 6.8264708.6 42.5 8.3 0.5 0.1 99.4| 0 0 0 4249 0| 501623 96.60| 09:05:50|79.63 65.71 43.01| 79 7 7 7| 0 0|16295.1 246.9260475.0 6442.4 8.7 0.5 0.1 99.1| 0 0 0 4231 0| 511032 96.70| 09:05:51|79.63 65.71 43.01| 80 7 6 6| 0 0|16568.9 8.7264919.7 104.7 8.3 0.5 0.1 99.7| 0 0 0 4272 0| 517177 96.68| 09:05:53|79.63 65.71 43.01| 79 7 7 6| 0 0|16539.0 8.6264502.9 87.6 8.4 0.5 0.1 98.9| 0 0 0 3992 0| 496728 96.52| 09:05:54|79.63 65.71 43.01| 79 7 7 7| 0 0|16527.5 11.6264363.6 92.6 8.5 0.5 0.1 98.8| 0 0 0 4045 0| 502944 96.59| 09:05:55|79.63 65.71 43.01| 80 7 7 6| 0 0|16374.7 12.5261687.2 134.9 8.6 0.5 0.1 99.2| 0 0 0 4143 0| 507006 96.66| 09:05:56|76.05 65.20 42.96| 77 8 8 8| 0 0|16464.9 9.6263314.3 111.9 8.5 0.5 0.1 98.9| 0 0 0 4250 0| 505417 96.64| 09:05:57|76.05 65.20 42.96| 79 7 6 7| 0 0|16460.1 8.8263283.2 93.4 8.3 0.5 0.1 98.8| 0 0 0 4294 0| 508168 96.66| 09:05:58|76.05 65.20 42.96| 80 7 7 7| 0 0|16176.5 9.6258762.1 127.3 8.3 0.5 0.1 98.9| 0 0 0 4160 0| 509349 96.72| 09:05:59|76.05 65.20 42.96| 75 7 9 10| 0 0|16522.0 10.7264274.6 93.1 8.6 0.5 0.1 97.5| 0 0 0 4034 0| 492623 96.51| -------- -----load-avg---- ---cpu-usage--- ---swap--- -------------------------io-usage----------------------- -QPS- -TPS- -Hit%- time | 1m 5m 15m |usr sys idl iow| si so| r/s w/s rkB/s wkB/s queue await svctm %util| ins upd del sel iud| lor hit| 09:06:00|76.05 65.20 42.96| 79 7 7 7| 0 0|16369.6 21.2261867.3 262.5 8.4 0.5 0.1 98.9| 0 0 0 4305 0| 494509 96.59| 09:06:01|75.33 65.23 43.09| 73 6 9 12| 0 0|15864.0 209.3253685.4 6238.0 10.0 0.6 0.1 98.7| 0 0 0 3913 0| 483480 96.62| 09:06:02|75.33 65.23 43.09| 73 7 8 12| 0 0|15854.7 12.7253613.2 93.6 11.0 0.7 0.1 99.0| 0 0 0 4271 0| 483771 96.64| 09:06:03|75.33 65.23 43.09| 75 7 9 9| 0 0|16074.8 8.7257104.3 81.7 8.1 0.5 0.1 98.5| 0 0 0 4060 0| 480701 96.55| 09:06:04|75.33 65.23 43.09| 76 7 8 9| 0 0|16221.7 9.7259500.1 139.4 8.1 0.5 0.1 97.6| 0 0 0 3953 0| 486774 96.56| 09:06:05|74.98 65.33 43.24| 78 7 8 8| 0 0|16330.7 8.7261166.5 85.3 8.2 0.5 0.1 98.5| 0 0 0 3957 0| 481775 96.53| 09:06:06|74.98 65.33 43.24| 75 7 9 9| 0 0|16093.7 11.7257436.1 93.7 8.2 0.5 0.1 99.2| 0 0 0 3938 0| 489251 96.60| 09:06:07|74.98 65.33 43.24| 75 7 5 13| 0 0|15758.9 19.2251989.4 188.2 14.7 0.9 0.1 99.7| 0 0 0 4140 0| 494738 96.70| 09:06:08|74.98 65.33 43.24| 69 7 10 15| 0 0|16166.3 8.7258474.9 81.2 8.9 0.5 0.1 98.7| 0 0 0 3993 0| 487162 96.58| 09:06:09|74.98 65.33 43.24| 74 7 9 10| 0 0|16071.0 8.7257010.9 93.3 8.2 0.5 0.1 99.2| 0 0 0 4098 0| 491557 96.61| 09:06:10|70.98 64.66 43.14| 71 7 9 12| 0 0|15549.6 216.1248701.1 6188.7 8.3 0.5 0.1 97.8| 0 0 0 3879 0| 480832 96.66| 09:06:11|70.98 64.66 43.14| 71 7 10 13| 0 0|16233.7 22.4259568.1 257.1 8.2 0.5 0.1 99.2| 0 0 0 4088 0| 493200 96.62| 09:06:12|70.98 64.66 43.14| 78 7 8 7| 0 0|15932.4 10.6254779.5 108.1 8.1 0.5 0.1 98.6| 0 0 0 4168 0| 489838 96.63| 09:06:13|70.98 64.66 43.14| 71 8 9 12| 0 0|16255.9 11.5259902.3 103.9 8.3 0.5 0.1 98.0| 0 0 0 3874 0| 481246 96.52| 09:06:14|70.98 64.66 43.14| 60 6 16 18| 0 0|15621.0 9.7249826.1 81.9 8.0 0.5 0.1 99.3| 0 0 0 3956 0| 480278 96.65|

Read the article

What is causing the unusual high load average and IOwait?

- by James

I noticed on Tuesday night of last week, the load average went up sharply and it seemed abnormal since the traffic is small. Usually, the numbers usually average around .40 or lower and my server stuff (mysql, php and apache) are optimized. I noticed that the IOWait is unusually high even though the processes is barely using any CPU. top - 01:44:39 up 1 day, 21:13, 1 user, load average: 1.41, 1.09, 0.86 Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 91.5%id, 8.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 331944k used, 716632k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.92 init 1656 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 9323 root 18 0 13652 5212 664 S 0 0.5 0:00.00 apache2 10079 root 18 0 3972 1248 972 S 0 0.1 0:00.00 su 10080 root 15 0 4612 1956 1448 S 0 0.2 0:00.01 bash 11298 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 11778 chikorit 15 0 2344 1092 884 S 0 0.1 0:00.05 top 15384 root 18 0 17544 13m 1568 S 0 1.3 0:02.28 miniserv.pl 15585 root 15 0 8280 2736 2168 S 0 0.3 0:00.02 sshd 15608 chikorit 15 0 8280 1436 860 S 0 0.1 0:00.02 sshd Here is the VMStat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 768644 0 0 0 0 14 23 0 10 1 0 99 0 IOStat - Nothing unusal Total DISK READ: 67.13 K/s | Total DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 19496 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19501 be/4 mysql 3.95 K/s 0.00 B/s 0.00 % 0.00 % mysqld 19568 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19569 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19570 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19571 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19573 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 11778 be/4 chikorit 0.00 B/s 0.00 B/s 0.00 % 0.00 % top 19470 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.00 % mysqld Load Average Chart - http://i.stack.imgur.com/kYsD0.png I want to be sure if this is not a MySQL problem before making sure. Also, this is a Ubuntu 10.04 LTS Server on OpenVZ. Edit: This will probably give a good picture on the IO Wait top - 22:12:22 up 17:41, 1 user, load average: 1.10, 1.09, 0.93 Tasks: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 0.2%sy, 0.0%ni, 89.0%id, 10.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 260708k used, 787868k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.88 init 5849 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 8063 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 9732 root 16 0 8280 2728 2168 S 0 0.3 0:00.02 sshd 9746 chikorit 18 0 8412 1444 864 S 0 0.1 0:01.10 sshd 9747 chikorit 18 0 4576 1960 1488 S 0 0.2 0:00.24 bash 13706 chikorit 15 0 2344 1088 884 R 0 0.1 0:00.03 top 15745 chikorit 15 0 12968 5108 1280 S 0 0.5 0:00.00 apache2 15751 chikorit 15 0 72184 25m 18m S 0 2.5 0:00.37 php5-fpm 15790 chikorit 18 0 12472 4640 1192 S 0 0.4 0:00.00 apache2 15797 chikorit 15 0 72888 23m 16m S 0 2.3 0:00.06 php5-fpm 16038 root 15 0 67772 2848 592 D 0 0.3 0:00.00 php5-fpm 16309 syslog 18 0 24084 1316 992 S 0 0.1 0:00.07 rsyslogd 16316 root 15 0 5472 908 500 S 0 0.1 0:00.00 sshd 16326 root 15 0 2304 908 712 S 0 0.1 0:00.02 cron 17464 root 15 0 10252 7560 856 D 0 0.7 0:01.88 psad 17466 root 18 0 1684 276 208 S 0 0.0 0:00.31 psadwatchd 17559 root 18 0 11444 2020 732 S 0 0.2 0:00.47 sendmail-mta 17688 root 15 0 10252 5388 1136 S 0 0.5 0:03.81 python 17752 teamspea 19 0 44648 7308 4676 S 0 0.7 1:09.70 ts3server_linux 18098 root 15 0 12336 6380 3032 S 0 0.6 0:00.47 apache2 18099 chikorit 18 0 10368 2536 464 S 0 0.2 0:00.00 apache2 18120 ntp 15 0 4336 1316 984 S 0 0.1 0:00.87 ntpd 18379 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 18387 mysql 15 0 62796 36m 5864 S 0 3.6 1:43.26 mysqld 19584 root 15 0 12336 4028 668 S 0 0.4 0:00.02 apache2 22498 root 16 0 12336 4028 668 S 0 0.4 0:00.00 apache2 24260 root 15 0 67772 3612 1356 S 0 0.3 0:00.22 php5-fpm 27712 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 27730 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30343 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30366 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2

Read the article

Determining which database instance makes biggest IO

- by user2008937

Assuming that I have a dedicated server on which I am running multiple instances of mysql and postresql servers. How without iotop determine which instance in particular time (proc/pid/io shows data collected in some peroid of time) makes the biggest IO (so it increases IOWAIT)? When lots of ppl do something on DB then I clearly see which instance is making the load because of high cpu usage, but I had a situation when the cpu usage was just normal, but very high iowait made a huge load on server and i had problem finding process that was making some outstanding IO

Read the article

Mysql performance problem & Failed DIMM

- by murdoch

Hi I have a dedicated mysql database server which has been having some performance problems recently, under normal load the server will be running fine, then suddenly out of the blue the performance will fall off a cliff. The server isn't using the swap file and there is 12GB of RAM in the server, more than enough for its needs. After contacting my hosting comapnies support they have discovered that there is a failed 2GB DIMM in the server and have scheduled to replace it tomorow morning. My question is could a failed DIMM result in the performance problems I am seeing or is this just coincidence? My worry is that they will replace the ram tomorrow but the problems will persist and I will still be lost of explanations so I am just trying to think ahead. The reason I ask is that there is plenty of RAM in the server, more than required and simply missing 2GB should be a problem, so if this failed DIMM is causing these performance problems then the OS must be trying to access the failed DIMM and slowing down as a result. Does that sound like a credible explanation? This is what DELLs omreport program says about the RAM, notice one dimm is "Critical" Memory Information Health : Critical Memory Operating Mode Fail Over State : Inactive Memory Operating Mode Configuration : Optimizer Attributes of Memory Array(s) Attributes : Location Memory Array 1 : System Board or Motherboard Attributes : Use Memory Array 1 : System Memory Attributes : Installed Capacity Memory Array 1 : 12288 MB Attributes : Maximum Capacity Memory Array 1 : 196608 MB Attributes : Slots Available Memory Array 1 : 18 Attributes : Slots Used Memory Array 1 : 6 Attributes : ECC Type Memory Array 1 : Multibit ECC Total of Memory Array(s) Attributes : Total Installed Capacity Value : 12288 MB Attributes : Total Installed Capacity Available to the OS Value : 12004 MB Attributes : Total Maximum Capacity Value : 196608 MB Details of Memory Array 1 Index : 0 Status : Ok Connector Name : DIMM_A1 Type : DDR3-Registered Size : 2048 MB Index : 1 Status : Ok Connector Name : DIMM_A2 Type : DDR3-Registered Size : 2048 MB Index : 2 Status : Ok Connector Name : DIMM_A3 Type : DDR3-Registered Size : 2048 MB Index : 3 Status : Critical Connector Name : DIMM_B1 Type : DDR3-Registered Size : 2048 MB Index : 4 Status : Ok Connector Name : DIMM_B2 Type : DDR3-Registered Size : 2048 MB Index : 5 Status : Ok Connector Name : DIMM_B3 Type : DDR3-Registered Size : 2048 MB the command free -m shows this, the server seems to be using more than 10GB of ram which would suggest it is trying to use the DIMM total used free shared buffers cached Mem: 12004 10766 1238 0 384 4809 -/+ buffers/cache: 5572 6432 Swap: 2047 0 2047 iostat output while problem is occuring avg-cpu: %user %nice %system %iowait %steal %idle 52.82 0.00 11.01 0.00 0.00 36.17 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 47.00 0.00 576.00 0 576 sda1 0.00 0.00 0.00 0 0 sda2 1.00 0.00 32.00 0 32 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 46.00 0.00 544.00 0 544 avg-cpu: %user %nice %system %iowait %steal %idle 53.12 0.00 7.81 0.00 0.00 39.06 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 49.00 0.00 592.00 0 592 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 49.00 0.00 592.00 0 592 avg-cpu: %user %nice %system %iowait %steal %idle 56.09 0.00 7.43 0.37 0.00 36.10 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 232.00 0.00 64520.00 0 64520 sda1 0.00 0.00 0.00 0 0 sda2 159.00 0.00 63728.00 0 63728 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 73.00 0.00 792.00 0 792 avg-cpu: %user %nice %system %iowait %steal %idle 52.18 0.00 9.24 0.06 0.00 38.51 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 49.00 0.00 600.00 0 600 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 49.00 0.00 600.00 0 600 avg-cpu: %user %nice %system %iowait %steal %idle 54.82 0.00 8.64 0.00 0.00 36.55 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 100.00 0.00 2168.00 0 2168 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 100.00 0.00 2168.00 0 2168 avg-cpu: %user %nice %system %iowait %steal %idle 54.78 0.00 6.75 0.00 0.00 38.48 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 84.00 0.00 896.00 0 896 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 84.00 0.00 896.00 0 896 avg-cpu: %user %nice %system %iowait %steal %idle 54.34 0.00 7.31 0.00 0.00 38.35 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 81.00 0.00 840.00 0 840 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 81.00 0.00 840.00 0 840 avg-cpu: %user %nice %system %iowait %steal %idle 55.18 0.00 5.81 0.44 0.00 38.58 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 317.00 0.00 105632.00 0 105632 sda1 0.00 0.00 0.00 0 0 sda2 224.00 0.00 104672.00 0 104672 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 93.00 0.00 960.00 0 960 avg-cpu: %user %nice %system %iowait %steal %idle 55.38 0.00 7.63 0.00 0.00 36.98 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 74.00 0.00 800.00 0 800 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 74.00 0.00 800.00 0 800 avg-cpu: %user %nice %system %iowait %steal %idle 56.43 0.00 7.80 0.00 0.00 35.77 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 72.00 0.00 784.00 0 784 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 72.00 0.00 784.00 0 784 avg-cpu: %user %nice %system %iowait %steal %idle 54.87 0.00 6.49 0.00 0.00 38.64 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 80.20 0.00 855.45 0 864 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 80.20 0.00 855.45 0 864 avg-cpu: %user %nice %system %iowait %steal %idle 57.22 0.00 5.69 0.00 0.00 37.09 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 33.00 0.00 432.00 0 432 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 33.00 0.00 432.00 0 432 avg-cpu: %user %nice %system %iowait %steal %idle 56.03 0.00 7.93 0.00 0.00 36.04 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 41.00 0.00 560.00 0 560 sda1 0.00 0.00 0.00 0 0 sda2 2.00 0.00 88.00 0 88 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 39.00 0.00 472.00 0 472 avg-cpu: %user %nice %system %iowait %steal %idle 55.78 0.00 5.13 0.00 0.00 39.09 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 29.00 0.00 392.00 0 392 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 29.00 0.00 392.00 0 392 avg-cpu: %user %nice %system %iowait %steal %idle 53.68 0.00 8.30 0.06 0.00 37.95 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 78.00 0.00 4280.00 0 4280 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 78.00 0.00 4280.00 0 4280

Read the article

What is causing the unusual high load average?

- by James

I noticed on Tuesday night of last week, the load average went up sharply and it seemed abnormal since the traffic is small. Usually, the numbers usually average around .40 or lower and my server stuff (mysql, php and apache) are optimized. I noticed that the IOWait is unusually high even though the processes is barely using any CPU. top - 01:44:39 up 1 day, 21:13, 1 user, load average: 1.41, 1.09, 0.86 Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 91.5%id, 8.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 331944k used, 716632k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.92 init 1656 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 9323 root 18 0 13652 5212 664 S 0 0.5 0:00.00 apache2 10079 root 18 0 3972 1248 972 S 0 0.1 0:00.00 su 10080 root 15 0 4612 1956 1448 S 0 0.2 0:00.01 bash 11298 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 11778 chikorit 15 0 2344 1092 884 S 0 0.1 0:00.05 top 15384 root 18 0 17544 13m 1568 S 0 1.3 0:02.28 miniserv.pl 15585 root 15 0 8280 2736 2168 S 0 0.3 0:00.02 sshd 15608 chikorit 15 0 8280 1436 860 S 0 0.1 0:00.02 sshd Here is the VMStat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 768644 0 0 0 0 14 23 0 10 1 0 99 0 IOStat - Nothing unusal Total DISK READ: 67.13 K/s | Total DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 19496 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19501 be/4 mysql 3.95 K/s 0.00 B/s 0.00 % 0.00 % mysqld 19568 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19569 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19570 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19571 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19573 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 11778 be/4 chikorit 0.00 B/s 0.00 B/s 0.00 % 0.00 % top 19470 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.00 % mysqld Load Average Chart - http://i.stack.imgur.com/kYsD0.png I want to be sure if this is not a MySQL problem before making sure. Also, this is a Ubuntu 10.04 LTS Server on OpenVZ. Edit: This will probably give a good picture on the IO Wait top - 22:12:22 up 17:41, 1 user, load average: 1.10, 1.09, 0.93 Tasks: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 0.2%sy, 0.0%ni, 89.0%id, 10.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 260708k used, 787868k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.88 init 5849 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 8063 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 9732 root 16 0 8280 2728 2168 S 0 0.3 0:00.02 sshd 9746 chikorit 18 0 8412 1444 864 S 0 0.1 0:01.10 sshd 9747 chikorit 18 0 4576 1960 1488 S 0 0.2 0:00.24 bash 13706 chikorit 15 0 2344 1088 884 R 0 0.1 0:00.03 top 15745 chikorit 15 0 12968 5108 1280 S 0 0.5 0:00.00 apache2 15751 chikorit 15 0 72184 25m 18m S 0 2.5 0:00.37 php5-fpm 15790 chikorit 18 0 12472 4640 1192 S 0 0.4 0:00.00 apache2 15797 chikorit 15 0 72888 23m 16m S 0 2.3 0:00.06 php5-fpm 16038 root 15 0 67772 2848 592 D 0 0.3 0:00.00 php5-fpm 16309 syslog 18 0 24084 1316 992 S 0 0.1 0:00.07 rsyslogd 16316 root 15 0 5472 908 500 S 0 0.1 0:00.00 sshd 16326 root 15 0 2304 908 712 S 0 0.1 0:00.02 cron 17464 root 15 0 10252 7560 856 D 0 0.7 0:01.88 psad 17466 root 18 0 1684 276 208 S 0 0.0 0:00.31 psadwatchd 17559 root 18 0 11444 2020 732 S 0 0.2 0:00.47 sendmail-mta 17688 root 15 0 10252 5388 1136 S 0 0.5 0:03.81 python 17752 teamspea 19 0 44648 7308 4676 S 0 0.7 1:09.70 ts3server_linux 18098 root 15 0 12336 6380 3032 S 0 0.6 0:00.47 apache2 18099 chikorit 18 0 10368 2536 464 S 0 0.2 0:00.00 apache2 18120 ntp 15 0 4336 1316 984 S 0 0.1 0:00.87 ntpd 18379 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 18387 mysql 15 0 62796 36m 5864 S 0 3.6 1:43.26 mysqld 19584 root 15 0 12336 4028 668 S 0 0.4 0:00.02 apache2 22498 root 16 0 12336 4028 668 S 0 0.4 0:00.00 apache2 24260 root 15 0 67772 3612 1356 S 0 0.3 0:00.22 php5-fpm 27712 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 27730 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30343 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30366 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 This is the free ram as of today. total used free shared buffers cached Mem: 1024 302 721 0 0 0 -/+ buffers/cache: 302 721 Swap: 0 0 0 Update: Looking into the logs, particularly the PHP5-FPM, which is causing the CPU spike. I found that its segment faulting for some apparent reason. [03-Jun-2012 06:11:20] NOTICE: [pool www] child 14132 started [03-Jun-2012 06:11:25] WARNING: [pool www] child 13664 exited on signal 11 (SIGSEGV) after 53.686322 seconds from start [03-Jun-2012 06:11:25] NOTICE: [pool www] child 14328 started [03-Jun-2012 06:11:25] WARNING: [pool www] child 14132 exited on signal 11 (SIGSEGV) after 4.708681 seconds from start [03-Jun-2012 06:11:25] NOTICE: [pool www] child 14329 started [03-Jun-2012 06:11:58] WARNING: [pool www] child 14328 exited on signal 11 (SIGSEGV) after 32.981228 seconds from start [03-Jun-2012 06:11:58] NOTICE: [pool www] child 15745 started [03-Jun-2012 06:12:25] WARNING: [pool www] child 15745 exited on signal 11 (SIGSEGV) after 27.442864 seconds from start [03-Jun-2012 06:12:25] NOTICE: [pool www] child 17446 started [03-Jun-2012 06:12:25] WARNING: [pool www] child 14329 exited on signal 11 (SIGSEGV) after 60.411278 seconds from start [03-Jun-2012 06:12:25] NOTICE: [pool www] child 17447 started [03-Jun-2012 06:13:02] WARNING: [pool www] child 17446 exited on signal 11 (SIGSEGV) after 36.746793 seconds from start [03-Jun-2012 06:13:02] NOTICE: [pool www] child 18133 started [03-Jun-2012 06:13:48] WARNING: [pool www] child 17447 exited on signal 11 (SIGSEGV) after 82.710107 seconds from start I'm thinking that this might be causing the problem. If that is the cause, probably switching it off that to fastcgi/fcgid might resolve it... but still, I want to see if something else might be causing it to do this.

Read the article

Strange Recurrent Excessive I/O Wait

- by Chris

I know quite well that I/O wait has been discussed multiple times on this site, but all the other topics seem to cover constant I/O latency, while the I/O problem we need to solve on our server occurs at irregular (short) intervals, but is ever-present with massive spikes of up to 20k ms a-wait and service times of 2 seconds. The disk affected is /dev/sdb (Seagate Barracuda, for details see below). A typical iostat -x output would at times look like this, which is an extreme sample but by no means rare: iostat (Oct 6, 2013) tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.00 0.00 156.00 9.75 21.89 288.12 36.00 57.60 5.50 0.00 44.00 8.00 48.79 2194.18 181.82 100.00 2.00 0.00 16.00 8.00 46.49 3397.00 500.00 100.00 4.50 0.00 40.00 8.89 43.73 5581.78 222.22 100.00 14.50 0.00 148.00 10.21 13.76 5909.24 68.97 100.00 1.50 0.00 12.00 8.00 8.57 7150.67 666.67 100.00 0.50 0.00 4.00 8.00 6.31 10168.00 2000.00 100.00 2.00 0.00 16.00 8.00 5.27 11001.00 500.00 100.00 0.50 0.00 4.00 8.00 2.96 17080.00 2000.00 100.00 34.00 0.00 1324.00 9.88 1.32 137.84 4.45 59.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22.00 44.00 204.00 11.27 0.01 0.27 0.27 0.60 Let me provide you with some more information regarding the hardware. It's a Dell 1950 III box with Debian as OS where uname -a reports the following: Linux xx 2.6.32-5-amd64 #1 SMP Fri Feb 15 15:39:52 UTC 2013 x86_64 GNU/Linux The machine is a dedicated server that hosts an online game without any databases or I/O heavy applications running. The core application consumes about 0.8 of the 8 GBytes RAM, and the average CPU load is relatively low. The game itself, however, reacts rather sensitive towards I/O latency and thus our players experience massive ingame lag, which we would like to address as soon as possible. iostat: avg-cpu: %user %nice %system %iowait %steal %idle 1.77 0.01 1.05 1.59 0.00 95.58 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 13.16 25.42 135.12 504701011 2682640656 sda 1.52 0.74 20.63 14644533 409684488 Uptime is: 19:26:26 up 229 days, 17:26, 4 users, load average: 0.36, 0.37, 0.32 Harddisk controller: 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) Harddisks: Array 1, RAID-1, 2x Seagate Cheetah 15K.5 73 GB SAS Array 2, RAID-1, 2x Seagate ST3500620SS Barracuda ES.2 500GB 16MB 7200RPM SAS Partition information from df: Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 480191156 30715200 425083668 7% /home /dev/sda2 7692908 437436 6864692 6% / /dev/sda5 15377820 1398916 13197748 10% /usr /dev/sda6 39159724 19158340 18012140 52% /var Some more data samples generated with iostat -dx sdb 1 (Oct 11, 2013) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 15.00 0.00 70.00 0.00 656.00 9.37 4.50 1.83 4.80 33.60 sdb 0.00 0.00 0.00 2.00 0.00 16.00 8.00 12.00 836.00 500.00 100.00 sdb 0.00 0.00 0.00 3.00 0.00 32.00 10.67 9.96 1990.67 333.33 100.00 sdb 0.00 0.00 0.00 4.00 0.00 40.00 10.00 6.96 3075.00 250.00 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 100.00 sdb 0.00 0.00 0.00 2.00 0.00 16.00 8.00 2.62 4648.00 500.00 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 100.00 sdb 0.00 0.00 0.00 1.00 0.00 16.00 16.00 1.69 7024.00 1000.00 100.00 sdb 0.00 74.00 0.00 124.00 0.00 1584.00 12.77 1.09 67.94 6.94 86.00 Characteristic charts generated with rrdtool can be found here: iostat plot 1, 24 min interval: http://imageshack.us/photo/my-images/600/yqm3.png/ iostat plot 2, 120 min interval: http://imageshack.us/photo/my-images/407/griw.png/ As we have a rather large cache of 5.5 GBytes, we thought it might be a good idea to test if the I/O wait spikes would perhaps be caused by cache miss events. Therefore, we did a sync and then this to flush the cache and buffers: echo 3 > /proc/sys/vm/drop_caches and directly afterwards the I/O wait and service times virtually went through the roof, and everything on the machine felt like slow motion. During the next few hours the latency recovered and everything was as before - small to medium lags in short, unpredictable intervals. Now my question is: does anybody have any idea what might cause this annoying behaviour? Is it the first indication of the disk array or the raid controller dying, or something that can be easily mended by rebooting? (At the moment we're very reluctant to do this, however, because we're afraid that the disks might not come back up again.) Any help is greatly appreciated. Thanks in advance, Chris. Edited to add: we do see one or two processes go to 'D' state in top, one of which seems to be kjournald rather frequently. If I'm not mistaken, however, this does not indicate the processes causing the latency, but rather those affected by it - correct me if I'm wrong. Does the information about uninterruptibly sleeping processes help us in any way to address the problem? @Andy Shinn requested smartctl data, here it is: smartctl -a -d megaraid,2 /dev/sdb yields: smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: SEAGATE ST3500620SS Version: MS05 Serial number: Device type: disk Transport protocol: SAS Local Time is: Mon Oct 14 20:37:13 2013 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 20 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 1236631092 Blocks received from initiator = 1097862364 Blocks read from cache and sent to initiator = 1383620256 Number of read and write commands whose size <= segment size = 531295338 Number of read and write commands whose size > segment size = 51986460 Vendor (Seagate/Hitachi) factory information number of hours powered up = 36556.93 number of minutes until next internal SMART test = 32 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 509271032 47 0 509271079 509271079 20981.423 0 write: 0 0 0 0 0 5022.039 0 verify: 1870931090 196 0 1870931286 1870931286 100558.708 0 Non-medium error count: 0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 16 36538 - [- - -] # 2 Background short Completed 16 36514 - [- - -] # 3 Background short Completed 16 36490 - [- - -] # 4 Background short Completed 16 36466 - [- - -] # 5 Background short Completed 16 36442 - [- - -] # 6 Background long Completed 16 36420 - [- - -] # 7 Background short Completed 16 36394 - [- - -] # 8 Background short Completed 16 36370 - [- - -] # 9 Background long Completed 16 36364 - [- - -] #10 Background short Completed 16 36361 - [- - -] #11 Background long Completed 16 2 - [- - -] #12 Background short Completed 16 0 - [- - -] Long (extended) Self Test duration: 6798 seconds [113.3 minutes] smartctl -a -d megaraid,3 /dev/sdb yields: smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: SEAGATE ST3500620SS Version: MS05 Serial number: Device type: disk Transport protocol: SAS Local Time is: Mon Oct 14 20:37:26 2013 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 19 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 288745640 Blocks received from initiator = 1097848399 Blocks read from cache and sent to initiator = 1304149705 Number of read and write commands whose size <= segment size = 527414694 Number of read and write commands whose size > segment size = 51986460 Vendor (Seagate/Hitachi) factory information number of hours powered up = 36596.83 number of minutes until next internal SMART test = 28 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 610862490 44 0 610862534 610862534 20470.133 0 write: 0 0 0 0 0 5022.480 0 verify: 2861227413 203 0 2861227616 2861227616 100872.443 0 Non-medium error count: 1 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 16 36580 - [- - -] # 2 Background short Completed 16 36556 - [- - -] # 3 Background short Completed 16 36532 - [- - -] # 4 Background short Completed 16 36508 - [- - -] # 5 Background short Completed 16 36484 - [- - -] # 6 Background long Completed 16 36462 - [- - -] # 7 Background short Completed 16 36436 - [- - -] # 8 Background short Completed 16 36412 - [- - -] # 9 Background long Completed 16 36404 - [- - -] #10 Background short Completed 16 36401 - [- - -] #11 Background long Completed 16 2 - [- - -] #12 Background short Completed 16 0 - [- - -] Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

Read the article

high load average, high wait, dmesg raid error messages (debian nfs server)

- by John Stumbles

Debian 6 on HP proliant (2 CPU) with raid (2*1.5T RAID1 + 2*2T RAID1 joined RAID0 to make 3.5T) running mainly nfs & imapd (plus samba for windows share & local www for previewing web pages); with local ubuntu desktop client mounting $HOME, laptops accessing imap & odd files (e.g. videos) via nfs/smb; boxes connected 100baseT or wifi via home router/switch uname -a Linux prole 2.6.32-5-686 #1 SMP Wed Jan 11 12:29:30 UTC 2012 i686 GNU/Linux Setup has been working for months but prone to intermittently going very slow (user experience on desktop mounting $HOME from server, or laptop playing videos) and now consistently so bad I've had to delve into it to try to find what's wrong(!) Server seems OK at low load e.g. (laptop) client (with $HOME on local disk) connecting to server's imapd and nfs mounting RAID to access 1 file: top shows load ~ 0.1 or less, 0 wait but when (desktop) client mounts $HOME and starts user KDE session (all accessing server) then top shows e.g. top - 13:41:17 up 3:43, 3 users, load average: 9.29, 9.55, 8.27 Tasks: 158 total, 1 running, 157 sleeping, 0 stopped, 0 zombie Cpu(s): 0.4%us, 0.4%sy, 0.0%ni, 49.0%id, 49.7%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 903856k total, 851784k used, 52072k free, 171152k buffers Swap: 0k total, 0k used, 0k free, 476896k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3935 root 20 0 2456 1088 784 R 2 0.1 0:00.02 top 1 root 20 0 2028 680 584 S 0 0.1 0:01.14 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.12 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:00.16 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root 20 0 0 0 0 S 0 0.0 0:00.42 events/0 10 root 20 0 0 0 0 S 0 0.0 0:02.26 events/1 11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset 12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper 13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns 14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr 15 root 20 0 0 0 0 S 0 0.0 0:00.00 pm 16 root 20 0 0 0 0 S 0 0.0 0:00.02 sync_supers 17 root 20 0 0 0 0 S 0 0.0 0:00.02 bdi-default 18 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/0 19 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/1 20 root 20 0 0 0 0 S 0 0.0 0:00.02 kblockd/0 21 root 20 0 0 0 0 S 0 0.0 0:00.08 kblockd/1 22 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpid 23 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpi_notify 24 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpi_hotplug 25 root 20 0 0 0 0 S 0 0.0 0:00.00 kseriod 28 root 20 0 0 0 0 S 0 0.0 0:04.19 kondemand/0 29 root 20 0 0 0 0 S 0 0.0 0:02.93 kondemand/1 30 root 20 0 0 0 0 S 0 0.0 0:00.00 khungtaskd 31 root 20 0 0 0 0 S 0 0.0 0:00.18 kswapd0 32 root 25 5 0 0 0 S 0 0.0 0:00.00 ksmd 33 root 20 0 0 0 0 S 0 0.0 0:00.00 aio/0 34 root 20 0 0 0 0 S 0 0.0 0:00.00 aio/1 35 root 20 0 0 0 0 S 0 0.0 0:00.00 crypto/0 36 root 20 0 0 0 0 S 0 0.0 0:00.00 crypto/1 203 root 20 0 0 0 0 S 0 0.0 0:00.00 ksuspend_usbd 204 root 20 0 0 0 0 S 0 0.0 0:00.00 khubd 205 root 20 0 0 0 0 S 0 0.0 0:00.00 ata/0 206 root 20 0 0 0 0 S 0 0.0 0:00.00 ata/1 207 root 20 0 0 0 0 S 0 0.0 0:00.14 ata_aux 208 root 20 0 0 0 0 S 0 0.0 0:00.01 scsi_eh_0 dmesg suggests there's a disk problem: .............. (previous episode) [13276.966004] raid1:md0: read error corrected (8 sectors at 489900360 on sdc7) [13276.966043] raid1: sdb7: redirecting sector 489898312 to another mirror [13279.569186] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13279.569211] ata4.00: irq_stat 0x40000008 [13279.569230] ata4.00: failed command: READ FPDMA QUEUED [13279.569257] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13279.569262] res 41/40:00:05:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13279.569306] ata4.00: status: { DRDY ERR } [13279.569321] ata4.00: error: { UNC } [13279.575362] ata4.00: configured for UDMA/133 [13279.575388] ata4: EH complete [13283.169224] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13283.169246] ata4.00: irq_stat 0x40000008 [13283.169263] ata4.00: failed command: READ FPDMA QUEUED [13283.169289] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13283.169294] res 41/40:00:07:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13283.169331] ata4.00: status: { DRDY ERR } [13283.169345] ata4.00: error: { UNC } [13283.176071] ata4.00: configured for UDMA/133 [13283.176104] ata4: EH complete [13286.224814] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13286.224837] ata4.00: irq_stat 0x40000008 [13286.224853] ata4.00: failed command: READ FPDMA QUEUED [13286.224879] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13286.224884] res 41/40:00:06:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13286.224922] ata4.00: status: { DRDY ERR } [13286.224935] ata4.00: error: { UNC } [13286.231277] ata4.00: configured for UDMA/133 [13286.231303] ata4: EH complete [13288.802623] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13288.802646] ata4.00: irq_stat 0x40000008 [13288.802662] ata4.00: failed command: READ FPDMA QUEUED [13288.802688] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13288.802693] res 41/40:00:05:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13288.802731] ata4.00: status: { DRDY ERR } [13288.802745] ata4.00: error: { UNC } [13288.808901] ata4.00: configured for UDMA/133 [13288.808927] ata4: EH complete [13291.380430] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13291.380453] ata4.00: irq_stat 0x40000008 [13291.380470] ata4.00: failed command: READ FPDMA QUEUED [13291.380496] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13291.380501] res 41/40:00:05:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13291.380577] ata4.00: status: { DRDY ERR } [13291.380594] ata4.00: error: { UNC } [13291.386517] ata4.00: configured for UDMA/133 [13291.386543] ata4: EH complete [13294.347147] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13294.347169] ata4.00: irq_stat 0x40000008 [13294.347186] ata4.00: failed command: READ FPDMA QUEUED [13294.347211] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13294.347217] res 41/40:00:06:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13294.347254] ata4.00: status: { DRDY ERR } [13294.347268] ata4.00: error: { UNC } [13294.353556] ata4.00: configured for UDMA/133 [13294.353583] sd 3:0:0:0: [sdc] Unhandled sense code [13294.353590] sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [13294.353599] sd 3:0:0:0: [sdc] Sense Key : Medium Error [current] [descriptor] [13294.353610] Descriptor sense data with sense descriptors (in hex): [13294.353616] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [13294.353635] 23 05 6a 06 [13294.353644] sd 3:0:0:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed [13294.353657] sd 3:0:0:0: [sdc] CDB: Read(10): 28 00 23 05 6a 00 00 00 08 00 [13294.353675] end_request: I/O error, dev sdc, sector 587557382 [13294.353726] ata4: EH complete [13294.366953] raid1:md0: read error corrected (8 sectors at 489900544 on sdc7) [13294.366992] raid1: sdc7: redirecting sector 489898496 to another mirror and they're happening quite frequently, which I guess is liable to account for the performance problem(?) # dmesg | grep mirror [12433.561822] raid1: sdc7: redirecting sector 489900464 to another mirror [12449.428933] raid1: sdb7: redirecting sector 489900504 to another mirror [12464.807016] raid1: sdb7: redirecting sector 489900512 to another mirror [12480.196222] raid1: sdb7: redirecting sector 489900520 to another mirror [12495.585413] raid1: sdb7: redirecting sector 489900528 to another mirror [12510.974424] raid1: sdb7: redirecting sector 489900536 to another mirror [12526.374933] raid1: sdb7: redirecting sector 489900544 to another mirror [12542.619938] raid1: sdc7: redirecting sector 489900608 to another mirror [12559.431328] raid1: sdc7: redirecting sector 489900616 to another mirror [12576.553866] raid1: sdc7: redirecting sector 489900624 to another mirror [12592.065265] raid1: sdc7: redirecting sector 489900632 to another mirror [12607.621121] raid1: sdc7: redirecting sector 489900640 to another mirror [12623.165856] raid1: sdc7: redirecting sector 489900648 to another mirror [12638.699474] raid1: sdc7: redirecting sector 489900656 to another mirror [12655.610881] raid1: sdc7: redirecting sector 489900664 to another mirror [12672.255617] raid1: sdc7: redirecting sector 489900672 to another mirror [12672.288746] raid1: sdc7: redirecting sector 489900680 to another mirror [12672.332376] raid1: sdc7: redirecting sector 489900688 to another mirror [12672.362935] raid1: sdc7: redirecting sector 489900696 to another mirror [12674.201177] raid1: sdc7: redirecting sector 489900704 to another mirror [12698.045050] raid1: sdc7: redirecting sector 489900712 to another mirror [12698.089309] raid1: sdc7: redirecting sector 489900720 to another mirror [12698.111999] raid1: sdc7: redirecting sector 489900728 to another mirror [12698.134006] raid1: sdc7: redirecting sector 489900736 to another mirror [12719.034376] raid1: sdc7: redirecting sector 489900744 to another mirror [12734.545775] raid1: sdc7: redirecting sector 489900752 to another mirror [12734.590014] raid1: sdc7: redirecting sector 489900760 to another mirror [12734.624050] raid1: sdc7: redirecting sector 489900768 to another mirror [12734.647308] raid1: sdc7: redirecting sector 489900776 to another mirror [12734.664657] raid1: sdc7: redirecting sector 489900784 to another mirror [12734.710642] raid1: sdc7: redirecting sector 489900792 to another mirror [12734.721919] raid1: sdc7: redirecting sector 489900800 to another mirror [12734.744732] raid1: sdc7: redirecting sector 489900808 to another mirror [12734.779330] raid1: sdc7: redirecting sector 489900816 to another mirror [12782.604564] raid1: sdb7: redirecting sector 1242934216 to another mirror [12798.264153] raid1: sdc7: redirecting sector 1242935080 to another mirror [13245.832193] raid1: sdb7: redirecting sector 489898296 to another mirror [13261.376929] raid1: sdb7: redirecting sector 489898304 to another mirror [13276.966043] raid1: sdb7: redirecting sector 489898312 to another mirror [13294.366992] raid1: sdc7: redirecting sector 489898496 to another mirror although the arrays are still running on all disks - they haven't given up on any yet: # cat /proc/mdstat Personalities : [raid1] [raid0] md10 : active raid0 md0[0] md1[1] 3368770048 blocks super 1.2 512k chunks md1 : active raid1 sde2[2] sdd2[1] 1464087824 blocks super 1.2 [2/2] [UU] md0 : active raid1 sdb7[0] sdc7[2] 1904684920 blocks super 1.2 [2/2] [UU] unused devices: <none> So I think I have some idea what the problem is but I am not a linux sysadmin expert by the remotest stretch of the imagination and would really appreciate some clue checking here with my diagnosis and what do I need to do: obviously I need to source another drive for sdc. (I'm guessing I could buy a larger drive if the price is right: I'm thinking that one day I'll need to grow the size of the array and that would be one less drive to replace with a larger one) then use mdadm to fail out the existing sdc, remove it and fit the new drive fdisk the new drive with the same size partition for the array as the old one had use mdadm to add the new drive into the array that sound OK?

Read the article

aligning truecrypt partition on 1.5TB 4kB sector drive

- by pQd

hi, aligning partitions to start at real physical sector of ssds / stripped raids / 4kB drives is a 'good thing to do'. but i've run into a problems when trying to do it for a truecrypt partition that will contain ext3 on it. or so it seems. when drive is question is partitioned properly and formatted with ext3 i get very reasonable write speeds around 70-80MB/s, but when i put truecrypt and ext3 on the top of it write performance becomes very unstable and goes between 1-25MB/s with very high io-wait. on the same server i dont have any performance issues with ext3 on the top of truecrypt on regular 512B-sector 500GB sata disks. so my best guess is that iowaits are caused by misalignment but i cannot really find reliable information on how to calculate optimal partition beginning. i've tried to start it at 128 logical sector, i've also tried 8132 sector as suggested here but both gave me very bad and unstable performance. do you have any experience with similar setup? thanks!

Read the article

How do I know if my disks are being hit with too many I/O reads or writes or both?

- by Mark F

I know a bit about disk I/O and bottlenecks relating to this especially when relating to databases. How do I really know what the max I/O numbers will be for my disks? What metric might be available to me for working out roughly (but needs to be a good approximation) of how much capacity (if you will) have I got left available in I/O. I've seen it before where things are bubbling along nicely and then all of a sudden, everything screams to a halt, and it ends up being an I/O bound problem. Is there a better way to predict when I/O is reaching its limits? This article was interesting but not giving the answer I desire. So, is my best bet surrounding just looking at 'CPU I/O WAIT'? There must be a more reactive method than this.

Read the article

How do I know if my disks are being hit with too much IO reads or writes or both?

- by Mark F

Hi All, So I know a bit about disk I/O and bottlenecks relating to this especially when relating to databases. But how do I really know what the max IO numbers will be for my disks? What metric might be available to me for working out roughly (but needs to be a good approximation) of how much capacity (if you will) have I got left available in I/O. I've seen it before where things are bubbling along nicely and then all of a sudden, everything screams to a halt, and it ends up being an IO bound problem. Is there a better way to predict when IO is reaching its limits? This article was interesting but not giving the answer I desire. "http://serverfault.com/questions/61510/linux-how-can-i-see-whats-waiting-for-disk-io". So is my best bet surrounding just looking at 'CPU IO WAIT'? There must be a more reactive method for this? Best, M

Read the article

MySQL high IO usage quries

- by jack

MySQL has a built-in slow query logger. Is there any options or third-party tools which are able to detect the queries causing high IO usage just in the way like what slow query logger does?

Read the article

High Server Load cannot figure out why

- by Tim Bolton

My server is currently running CentOS 5.2, with WHM 11.34. Currently, we're at 6.43 to 12 for a load average. The sites that we're hosting are taking a lot time to respond and resolve. top doesn't show anything out of the ordinary and iftop doesn't show a lot of traffic. We have many resellers, and some not so good at writing code, how can we find the culprit? vmstat output: vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 2 84 78684 154916 1021080 0 0 72 274 0 14 6 3 80 12 0 top output (ordered by %CPU) top - 21:44:43 up 5 days, 10:39, 3 users, load average: 3.36, 4.18, 4.73 Tasks: 222 total, 3 running, 219 sleeping, 0 stopped, 0 zombie Cpu(s): 5.8%us, 2.3%sy, 0.2%ni, 79.6%id, 11.8%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 2074580k total, 1863044k used, 211536k free, 174828k buffers Swap: 2040212k total, 84k used, 2040128k free, 987604k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15930 mysql 15 0 138m 46m 4380 S 4 2.3 1:45.87 mysqld 21772 igniteth 17 0 23200 7152 3932 R 4 0.3 0:00.02 php 1586 root 10 -5 0 0 0 S 2 0.0 11:45.19 kjournald 21759 root 15 0 2416 1024 732 R 2 0.0 0:00.01 top 1 root 15 0 2156 648 560 S 0 0.0 0:26.31 init 2 root RT 0 0 0 0 S 0 0.0 0:00.35 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.32 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 5 root RT 0 0 0 0 S 0 0.0 0:02.00 migration/1 6 root 34 19 0 0 0 S 0 0.0 0:00.11 ksoftirqd/1 7 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 8 root RT 0 0 0 0 S 0 0.0 0:01.29 migration/2 9 root 34 19 0 0 0 S 0 0.0 0:00.26 ksoftirqd/2 10 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/2 11 root RT 0 0 0 0 S 0 0.0 0:00.90 migration/3 12 root 34 19 0 0 0 R 0 0.0 0:00.20 ksoftirqd/3 13 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/3 top output (ordered by CPU time) top - 21:46:12 up 5 days, 10:41, 3 users, load average: 2.88, 3.82, 4.55 Tasks: 217 total, 1 running, 216 sleeping, 0 stopped, 0 zombie Cpu(s): 3.7%us, 2.0%sy, 2.0%ni, 67.2%id, 25.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 2074580k total, 1959516k used, 115064k free, 183116k buffers Swap: 2040212k total, 84k used, 2040128k free, 1090308k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TIME COMMAND 32367 root 16 0 215m 212m 1548 S 0 10.5 62:03.63 62:03 tailwatchd 1586 root 10 -5 0 0 0 S 0 0.0 11:45.27 11:45 kjournald 1576 root 10 -5 0 0 0 S 0 0.0 2:37.86 2:37 kjournald 27722 root 16 0 2556 1184 800 S 0 0.1 1:48.94 1:48 top 15930 mysql 15 0 138m 46m 4380 S 4 2.3 1:48.63 1:48 mysqld 2932 root 34 19 0 0 0 S 0 0.0 1:41.05 1:41 kipmi0 226 root 10 -5 0 0 0 S 0 0.0 1:34.33 1:34 kswapd0 2671 named 25 0 74688 7400 2116 S 0 0.4 1:23.58 1:23 named 3229 root 15 0 10300 3348 2724 S 0 0.2 0:40.85 0:40 sshd 1580 root 10 -5 0 0 0 S 0 0.0 0:30.62 0:30 kjournald 1 root 17 0 2156 648 560 S 0 0.0 0:26.32 0:26 init 2616 root 15 0 1816 576 480 S 0 0.0 0:23.50 0:23 syslogd 1584 root 10 -5 0 0 0 S 0 0.0 0:18.67 0:18 kjournald 4342 root 34 19 27692 11m 2116 S 0 0.5 0:18.23 0:18 yum-updatesd 8044 bollingp 15 0 3456 2036 740 S 1 0.1 0:15.56 0:15 imapd 26 root 10 -5 0 0 0 S 0 0.0 0:14.18 0:14 kblockd/1 7989 gmailsit 16 0 3196 1748 736 S 0 0.1 0:10.43 0:10 imapd iostat -xtk 1 10 output [root@server1 tmp]# iostat -xtk 1 10 Linux 2.6.18-53.el5 12/18/2012 Time: 09:51:06 PM avg-cpu: %user %nice %system %iowait %steal %idle 5.83 0.19 2.53 11.85 0.00 79.60 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 1.37 118.83 18.70 54.27 131.47 692.72 22.59 4.90 67.19 3.10 22.59 sdb 0.35 39.33 20.33 61.43 158.79 403.22 13.75 5.23 63.93 3.77 30.80 Time: 09:51:07 PM avg-cpu: %user %nice %system %iowait %steal %idle 1.50 0.00 0.50 24.00 0.00 74.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 25.00 2.00 2.00 128.00 108.00 118.00 0.03 7.25 4.00 1.60 sdb 0.00 16.00 41.00 145.00 200.00 668.00 9.33 107.92 272.72 5.38 100.10 Time: 09:51:08 PM avg-cpu: %user %nice %system %iowait %steal %idle 2.00 0.00 1.50 29.50 0.00 67.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 95.00 3.00 33.00 12.00 480.00 27.33 0.07 1.72 1.31 4.70 sdb 0.00 14.00 1.00 228.00 4.00 960.00 8.42 143.49 568.01 4.37 100.10 Time: 09:51:09 PM avg-cpu: %user %nice %system %iowait %steal %idle 13.28 0.00 2.76 21.30 0.00 62.66 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 21.00 1.00 19.00 16.00 192.00 20.80 0.06 3.55 1.30 2.60 sdb 0.00 36.00 28.00 181.00 124.00 884.00 9.65 121.16 617.31 4.79 100.10 Time: 09:51:10 PM avg-cpu: %user %nice %system %iowait %steal %idle 4.74 0.00 1.50 25.19 0.00 68.58 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 20.00 3.00 15.00 12.00 136.00 16.44 0.17 7.11 3.11 5.60 sdb 0.00 0.00 103.00 60.00 544.00 248.00 9.72 52.35 545.23 6.14 100.10 Time: 09:51:11 PM avg-cpu: %user %nice %system %iowait %steal %idle 1.24 0.00 1.24 25.31 0.00 72.21 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 75.00 4.00 28.00 16.00 416.00 27.00 0.08 3.72 2.03 6.50 sdb 2.00 9.00 124.00 17.00 616.00 104.00 10.21 3.73 213.73 7.10 100.10 Time: 09:51:12 PM avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.75 24.31 0.00 73.93 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 24.00 1.00 9.00 4.00 132.00 27.20 0.01 1.20 1.10 1.10 sdb 4.00 40.00 103.00 48.00 528.00 212.00 9.80 105.21 104.32 6.64 100.20 Time: 09:51:13 PM avg-cpu: %user %nice %system %iowait %steal %idle 2.50 0.00 1.75 23.25 0.00 72.50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 125.74 3.96 46.53 15.84 689.11 27.92 0.20 4.06 2.41 12.18 sdb 2.97 0.00 91.09 84.16 419.80 471.29 10.17 85.85 590.78 5.66 99.11 Time: 09:51:14 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.75 0.00 0.50 24.94 0.00 73.82 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 88.00 1.00 7.00 4.00 380.00 96.00 0.04 4.38 3.00 2.40 sdb 3.00 7.00 111.00 44.00 540.00 208.00 9.65 18.58 581.79 6.46 100.10 Time: 09:51:15 PM avg-cpu: %user %nice %system %iowait %steal %idle 11.03 0.00 3.26 26.57 0.00 59.15 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 145.00 7.00 53.00 28.00 792.00 27.33 0.15 2.50 1.55 9.30 sdb 1.00 0.00 155.00 0.00 800.00 0.00 10.32 2.85 18.63 6.46 100.10 [root@server1 tmp]# MySQL Show Full Processlist mysql> show full processlist; +------+---------------+-----------+-----------------------+----------------+------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +------+---------------+-----------+-----------------------+----------------+------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 1 | DB_USER_ONE | localhost | DB_ONE | Query | 3 | waiting for handler insert | INSERT DELAYED INTO defers (mailtime,msgid,email,transport_method,message,host,ip,router,deliveryuser,deliverydomain) VALUES(FROM_UNIXTIME('1355879748'),'1TivwL-0003y8-8l','[email protected]','remote_smtp','SMTP error from remote mail server after initial connection: host mx1.mail.tw.yahoo.com [203.188.197.119]: 421 4.7.0 [TS01] Messages from 75.125.90.146 temporarily deferred due to user complaints - 4.16.55.1; see http://postmaster.yahoo.com/421-ts01.html','mx1.mail.tw.yahoo.com','203.188.197.119','lookuphost','','') | | 2 | DELAYED | localhost | DB_ONE | Delayed insert | 52 | insert | | | 3 | DELAYED | localhost | DB_ONE | Delayed insert | 68 | insert | | | 911 | DELAYED | localhost | DB_ONE | Delayed insert | 99 | Waiting for INSERT | | | 993 | DB_USER_TWO | localhost | DB_TWO | Sleep | 832 | | NULL | | 994 | DB_USER_ONE | localhost | DB_ONE | Query | 185 | Locked | delete from failures where FROM_UNIXTIME(UNIX_TIMESTAMP(NOW())-1296000) > mailtime | | 1102 | DB_USER_THREE | localhost | DB_THREE | Query | 29 | NULL | commit | | 1249 | DB_USER_FOUR | localhost | DB_FOUR | Query | 13 | NULL | commit | | 1263 | root | localhost | DB_FIVE | Query | 0 | NULL | show full processlist | | 1264 | DB_USER_SIX | localhost | DB_SIX | Query | 3 | NULL | commit | +------+---------------+-----------+-----------------------+----------------+------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows in set (0.00 sec)

Read the article

baffling stat pauses - (reiserfs)

- by Twirrim

I've got a SuSE server with reiserfs formatted partitions, all on a RAID1 mirror. I'm noticing odd spikes in iowait, but generally all seems okay. iostat claims not even 3% iowait in general: avg-cpu: %user %nice %system %iowait %steal %idle 6.53 0.03 1.45 2.92 0.00 89.07 After tracking down some odd behaviour when ls'ing, it appears "stat" (ls is aliased to include color which does a stat process on each file) randomly takes 5 seconds on a file in the directory. ReiserFS does a file system sync every 5 seconds, but that's interval between not length of time taken to sync. Out of curiosity I did remount using noatime to see if that would help though as it would reduce sync's workload, but no joy. Anyone got any thoughts what might be causing this pause? Disks appear healthy, RAID controller believes the RAID is healthy, and io stats show the disks aren't working very hard at all.

Read the article

Why does mpstat show different values when I use the interval setting?

- by Abe

Here's the output I get when I run mpstat: $mpstat Linux 3.2.0-30-generic (my-laptop-C650) 09/17/2012 _x86_64_ (2 CPU) 05:32:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:32:01 PM all 9.16 0.08 2.69 2.00 0.00 0.04 0.00 0.00 86.02 And here's what I get when I run it with a one-second interval: $mpstat 1 05:31:51 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:31:52 PM all 1.52 0.00 1.01 0.00 0.00 0.00 0.00 0.00 97.47 05:31:53 PM all 2.04 0.00 1.02 0.00 0.00 0.00 0.00 0.00 96.94 05:31:54 PM all 1.50 0.00 1.50 0.00 0.00 0.00 0.00 0.00 97.00 Why does the first process show the processor as 86% idle, and the second show it as ~97% idle? I've tried this in a bunch of different configurations, and it's not a real difference in CPU usage -- unless mpstat itself is making the difference. Which number should I trust?

Read the article

Apache's htcacheclean doesn't scale: How to tame a huge Apache disk_cache?

- by flight

We have an Apache setup with a huge disk_cache (500.000 entries, 50 GB disk space used). The cache grows by 16 GB every day. My problem is that the cache seems to be growing nearly as fast as it's possible to remove files and directories from the cache filesystem! The cache partition is an ext3 filesystem (100GB, "-t news") on an iSCSI storage. The Apache server (which acts as a caching proxy) is a VM. The disk_cache is configured with CacheDirLevels=2 and CacheDirLength=1, and includes variants. A typical file path is "/htcache/B/x/i_iGfmmHhxJRheg8NHcQ.header.vary/A/W/oGX3MAV3q0bWl30YmA_A.header". When I try to call htcacheclean to tame the cache (non-daemon mode, "htcacheclean-t -p/htcache -l15G"), IOwait is going through the roof for several hours. Without any visible action. Only after hours, htcacheclean starts to delete files from the cache partition, which takes a couple more hours. (A similar problem was brought up in the Apache mailing list in 2009, without a solution: http://www.mail-archive.com/[email protected]/msg42683.html) The high IOwait leads to problems with the stability of the web server (the bridge to the Tomcat backend server sometimes stalls). I came up with my own prune script, which removes files and directories from random subdirectories of the cache. Only to find that the deletion rate of the script is just slightly higher than the cache growth rate. The script takes ~10 seconds to read the a subdirectory (e.g. /htcache/B/x) and frees some 5 MB of disk space. In this 10 seconds, the cache has grown by another 2 MB. As with htcacheclean, IOwait goes up to 25% when running the prune script continuously. Any idea? Is this a problem specific to the (rather slow) iSCSI storage? Should I choose a different file system for a huge disk_cache? ext2? ext4? Are there any kernel parameter optimizations for this kind of scenario? (I already tried the deadline scheduler and a smaller read_ahead_kb, without effect).

Read the article

No apparent reason for high load average

- by Oz.

We have several web servers running on Amazon (ec2) c1.xlarge, over Amazon AMI. The servers are duplicates of each other, running the exact same hardware and software. Each server spec is: 7 GB of memory 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each) 1690 GB of instance storage 64-bit platform I/O Performance: High API name: c1.xlarge A couple of weeks ago we have run a yum upgrade on one of the servers. Starting on this upgrade the upgraded server started showing a high load average. Needless to say, we did not update the other servers and we can not do so until we understand the reason for this behavior. The strange thing is that when we compare the servers using top or iostat, we can not find the reason for the high load. Note that we have moved traffic from the "problematic" server to the others, which have made the "problematic" server less crowded in terms of requests, and still his load is higher. Do you have any idea what could it be, or where else can we check? Many thanks for the help! Oz. # # proper server # w command # 00:42:26 up 2 days, 19:54, 2 users, load average: 0.41, 0.48, 0.49 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT pts/1 82.80.137.29 00:28 14:05 0.01s 0.01s -bash pts/2 82.80.137.29 00:38 0.00s 0.02s 0.00s w # # proper server # iostat command # Linux 3.2.12-3.2.4.amzn1.x86_64 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 9.03 0.02 4.26 0.17 0.13 86.39 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvdap1 1.63 1.50 55.00 367236 13444008 xvdfp1 4.41 45.93 70.48 11227226 17228552 xvdfp2 2.61 2.01 59.81 491890 14620104 xvdfp3 8.16 14.47 94.23 3536522 23034376 xvdfp4 0.98 0.79 45.86 192818 11209784 # # problematic server # w command # 00:43:26 up 2 days, 21:52, 2 users, load average: 1.35, 1.10, 1.17 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT pts/0 82.80.137.29 00:28 15:04 0.02s 0.02s -bash pts/1 82.80.137.29 00:38 0.00s 0.05s 0.00s w # # problematic server # iostat command # Linux 3.2.20-1.29.6.amzn1.x86_64 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 7.97 0.04 3.43 0.19 0.07 88.30 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvdap1 2.10 1.49 76.54 374660 19253592 xvdfp1 5.64 40.98 85.92 10308946 21612112 xvdfp2 3.97 4.32 93.18 1087090 23439488 xvdfp3 10.87 30.30 115.14 7622474 28961720 xvdfp4 1.12 0.28 65.54 71034 16487112

Read the article

"iostat" command different in two equal machines

- by Oz.

We have several machines on Amazon (ec2) of the type c1.xlarge with 8 cpus, running the Amazon AMI. Details on the machine: 7 GB of memory 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each) 1690 GB of instance storage 64-bit platform I/O Performance: High API name: c1.xlarge One out of the several machines is showing a high load average, since we have run the last yum upgrade a couple of weeks a go. We did not yet update the other machines, and everything looks normal on them. The strange thing is that the top command not showing any hint for the cause of the load. CPUs are - 4.8%us, 1.1%sy, 0.0%ni, 94.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st. Mem is about 1.5GB free. Any idea what could it be, or where else can we check? iostat command on the proper machine: avg-cpu: %user %nice %system %iowait %steal %idle 8.97 0.03 4.46 0.19 0.14 86.23 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvdap1 1.60 0.69 55.38 587620 47254184 xvdfp2 2.64 1.10 61.04 934786 52091056 xvdfp4 0.86 0.19 41.72 163866 35601920 xvdfp1 4.37 36.59 73.89 31220810 63051504 xvdfp3 8.03 7.08 94.63 6045402 80749184 iostat command on problematic machine: avg-cpu: %user %nice %system %iowait %steal %idle 9.29 0.04 5.55 0.26 0.11 84.74 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvdap1 2.13 3.34 68.85 246244 5077888 xvdfp1 7.60 74.31 104.88 5480362 7734840 xvdfp3 13.22 73.67 125.00 5433386 9218600 xvdfp4 1.11 0.76 65.08 55762 4799248 xvdfp2 4.16 3.31 99.17 243818 7313264 Many thanks for the help.

Read the article

task blocked for more than

- by Manuel Sopena Ballesteros

I have a webserver with the configuration below: VMWare ESXi environemt CPanel installed CentOS release 6.5 (Final) 4 CPUs 2G RAM 2x VM disks 100G each LVM system My issue is I am getting kernel panic quite frequently. These is a list of some processes blocked I could see from the console: mysqld queueprocd httpd suphp vmtoolsd loop0 auditd this is my sar logs Linux 2.6.32-431.3.1.el6.x86_64 (test01) 08/22/2014 _x86_64_ (4 CPU) 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:10:01 AM all 26.86 0.01 0.98 0.57 0.00 71.57 12:20:01 AM all 1.78 0.02 1.03 0.08 0.00 97.09 12:30:01 AM all 26.34 0.02 0.85 0.05 0.00 72.74 12:40:01 AM all 27.12 0.01 1.11 1.22 0.00 70.54 12:50:01 AM all 1.59 0.02 0.94 0.13 0.00 97.32 01:00:01 AM all 26.10 0.01 0.77 0.04 0.00 73.07 01:10:01 AM all 27.51 0.01 1.16 0.14 0.00 71.18 01:20:01 AM all 1.80 0.07 1.06 0.08 0.00 96.99 01:30:01 AM all 26.19 0.01 0.78 0.05 0.00 72.96 01:40:01 AM all 26.62 0.02 0.87 0.05 0.00 72.45 01:50:02 AM all 1.35 0.01 0.87 0.02 0.00 97.75 02:00:01 AM all 26.11 0.02 0.69 0.02 0.00 73.17 02:10:01 AM all 26.73 0.02 0.89 0.14 0.00 72.21 02:20:01 AM all 1.45 0.01 0.92 0.04 0.00 97.58 02:30:01 AM all 26.59 0.01 1.06 0.03 0.00 72.31 02:40:01 AM all 26.27 0.01 0.72 0.05 0.00 72.95 02:50:01 AM all 0.86 0.01 0.50 0.09 0.00 98.53 03:00:01 AM all 25.61 0.02 0.39 0.03 0.00 73.96 03:10:01 AM all 26.30 0.08 0.66 0.14 0.00 72.82 03:20:01 AM all 0.81 0.01 0.51 0.04 0.00 98.63 03:30:02 AM all 26.15 0.02 0.53 0.07 0.00 73.24 03:40:01 AM all 26.06 0.01 0.47 0.04 0.00 73.42 03:50:01 AM all 0.96 0.02 0.51 0.03 0.00 98.48 Average: all 17.69 0.02 0.79 0.14 0.00 81.36 06:58:14 AM LINUX RESTART 07:00:01 AM CPU %user %nice %system %iowait %steal %idle 07:10:01 AM all 1.04 0.02 0.57 0.95 0.00 97.42 07:20:02 AM all 0.66 0.01 0.39 0.06 0.00 98.87 07:30:01 AM all 25.71 0.01 0.45 0.16 0.00 73.67 07:40:01 AM all 25.88 0.01 0.35 0.08 0.00 73.68 As you can see the server became unresponsive at 03.50 AM and I had to reset the VM at 06.58 AM to fix it. dmesg does not show any relevant information. I don't see any bottleneck in sar, any idea what can I check next? thank you very much

Read the article

kernel panic after LVM setup

- by Manuel Sopena Ballesteros

I broke my webserver... My setup is: VMWare ESXi environemt CPanel installed CentOS release 6.5 (Final) 4 CPUs 2G RAM 2x VM disks 100G each LVM system This was my previous storage settings (the server was working fine at this time): # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_test01-lv_root 95G 1.4G 88G 2% / tmpfs 939M 0 939M 0% /dev/shm /dev/sdb1 99G 188M 94G 1% /tmp /dev/sda1 485M 54M 407M 12% /boot My web developer asked me to merge /tmp and / disks so this is what I did: Delete /dev/sdb1 partition using fdisk Create a new partition as LVM on /dev/sdb1 using fdisk Create a new physical volume -- pvcreate /dev/sdb1 Extend volume group -- vgextend /dev/sdb1 vg_test01 Extend logical volume -- lvextend -l +100%FREE /dev/vg_test01/lv_root Resize filesystem -- resize2fs /dev/vg_test01/lv_root This is the new configuration: # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_test01-lv_root 213G 105G 97G 52% / tmpfs 939M 0 939M 0% /dev/shm /dev/sda1 485M 54M 407M 12% /boot /usr/tmpDSK 4.0G 145M 3.6G 4% /tmp Since I have the new settings my web server is throwing kernel panics quite often (around every 2 days). The message says: INFO: task <taskName>:<pid> blocked for more than 120 seconds. The list of process affected that I can see from the console are: mysqld queueprocd httpd suphp vmtoolsd loop0 auditd The only way I can fix this is reseting (cold reboot) the VM. I don't think it is a hardware issue as sar is not showing any bottleneck: Linux 2.6.32-431.3.1.el6.x86_64 (test01) 08/22/2014 _x86_64_ (4 CPU) 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:10:01 AM all 26.86 0.01 0.98 0.57 0.00 71.57 12:20:01 AM all 1.78 0.02 1.03 0.08 0.00 97.09 12:30:01 AM all 26.34 0.02 0.85 0.05 0.00 72.74 12:40:01 AM all 27.12 0.01 1.11 1.22 0.00 70.54 12:50:01 AM all 1.59 0.02 0.94 0.13 0.00 97.32 01:00:01 AM all 26.10 0.01 0.77 0.04 0.00 73.07 01:10:01 AM all 27.51 0.01 1.16 0.14 0.00 71.18 01:20:01 AM all 1.80 0.07 1.06 0.08 0.00 96.99 01:30:01 AM all 26.19 0.01 0.78 0.05 0.00 72.96 01:40:01 AM all 26.62 0.02 0.87 0.05 0.00 72.45 01:50:02 AM all 1.35 0.01 0.87 0.02 0.00 97.75 02:00:01 AM all 26.11 0.02 0.69 0.02 0.00 73.17 02:10:01 AM all 26.73 0.02 0.89 0.14 0.00 72.21 02:20:01 AM all 1.45 0.01 0.92 0.04 0.00 97.58 02:30:01 AM all 26.59 0.01 1.06 0.03 0.00 72.31 02:40:01 AM all 26.27 0.01 0.72 0.05 0.00 72.95 02:50:01 AM all 0.86 0.01 0.50 0.09 0.00 98.53 03:00:01 AM all 25.61 0.02 0.39 0.03 0.00 73.96 03:10:01 AM all 26.30 0.08 0.66 0.14 0.00 72.82 03:20:01 AM all 0.81 0.01 0.51 0.04 0.00 98.63 03:30:02 AM all 26.15 0.02 0.53 0.07 0.00 73.24 03:40:01 AM all 26.06 0.01 0.47 0.04 0.00 73.42 03:50:01 AM all 0.96 0.02 0.51 0.03 0.00 98.48 Average: all 17.69 0.02 0.79 0.14 0.00 81.36 06:58:14 AM LINUX RESTART 07:00:01 AM CPU %user %nice %system %iowait %steal %idle 07:10:01 AM all 1.04 0.02 0.57 0.95 0.00 97.42 07:20:02 AM all 0.66 0.01 0.39 0.06 0.00 98.87 07:30:01 AM all 25.71 0.01 0.45 0.16 0.00 73.67 07:40:01 AM all 25.88 0.01 0.35 0.08 0.00 73.68 07:50:01 AM all 1.13 0.02 0.55 0.11 0.00 98.19 As you can see the server became unresponsive at 03.50 AM and I had to reset the VM at 06.58 AM to bring the website up again. I would appreciate any help/assistance to fix this issue. thank you very much

Search Results

Search found 67 results on 3 pages for 'iowait'.

Page 1/3 | 1 2 3 | Next Page >

- by alfish

- by netom

- by Silvia

- by Dan

- by w00t

- by user64205

- by mq44944

- by James

- by user2008937

- by murdoch

- by James

- by Chris

- by John Stumbles

- by pQd

- by Mark F

- by Mark F

- by jack

- by Tim Bolton

- by Twirrim

- by Abe

- by flight

- by Oz.

- by Oz.

- by Manuel Sopena Ballesteros

- by Manuel Sopena Ballesteros

1 2 3 | Next Page >