I am bringing to ServerFault a problem that is tormenting me for 6+ months. I have a CentOS 6 (64bit) server with an md software raid-1 array with 2 x Samsung 840 Pro SSDs (512GB).
Problems:
Serious write speed problems:
root [~]# time dd if=arch.tar.gz of=test4 bs=2M oflag=sync
146+1 records in
146+1 records out
307191761 bytes (307 MB) copied, 23.6788 s, 13.0 MB/s
real 0m23.680s
user 0m0.000s
sys 0m0.932s
When doing the above (or any other larger copy) the load spikes to unbelievable values (even over 100) going up from ~ 1.
When doing the above I've also noticed very weird iostat results:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1589.50 0.00 54.00 0.00 13148.00 243.48 0.60 11.17 0.46 2.50
sdb 0.00 1627.50 0.00 16.50 0.00 9524.00 577.21 144.25 1439.33 60.61 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 1602.00 0.00 12816.00 8.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
And it keeps it this way until it actually writes the file to the device (out from swap/cache/memory).
The problem is that the second SSD in the array has svctm and await roughly 100 times larger than the second.
For some reason the wear is different between the 2 members of the array
root [~]# smartctl --attributes /dev/sda | grep -i wear
177 Wear_Leveling_Count 0x0013 094% 094 000 Pre-fail Always - 180
root [~]# smartctl --attributes /dev/sdb | grep -i wear
177 Wear_Leveling_Count 0x0013 070% 070 000 Pre-fail Always - 1005
The first SSD has a wear of 6% while the second SSD has a wear of 30%!!
It's like the second SSD in the array works at least 5 times as hard as the first one as proven by the first iteration of iostat (the averages since reboot):
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 10.44 51.06 790.39 125.41 8803.98 1633.11 11.40 0.33 0.37 0.06 5.64
sdb 9.53 58.35 322.37 118.11 4835.59 1633.11 14.69 0.33 0.76 0.29 12.97
md1 0.00 0.00 1.88 1.33 15.07 10.68 8.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 1109.02 173.12 10881.59 1620.39 9.75 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.41 0.01 3.10 0.02 7.42 0.00 0.00 0.00 0.00
What I've tried:
I've updated the firmware to DXM05B0Q (following reports of dramatic improvements for 840Ps after this update).
I have looked for "hard resetting link" in dmesg to check for cable/backplane issues but nothing.
I have checked the alignment and I believe they are aligned correctly (1MB boundary, listing below)
I have checked /proc/mdstat and the array is Optimal (second listing below).
root [~]# fdisk -ul /dev/sda
Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00026d59
Device Boot Start End Blocks Id System
/dev/sda1 2048 4196351 2097152 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2 * 4196352 4605951 204800 fd Linux raid autodetect
Partition 2 does not end on cylinder boundary.
/dev/sda3 4605952 814106623 404750336 fd Linux raid autodetect
root [~]# fdisk -ul /dev/sdb
Disk /dev/sdb: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003dede
Device Boot Start End Blocks Id System
/dev/sdb1 2048 4196351 2097152 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2 * 4196352 4605951 204800 fd Linux raid autodetect
Partition 2 does not end on cylinder boundary.
/dev/sdb3 4605952 814106623 404750336 fd Linux raid autodetect
/proc/mdstat
root # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1] sda2[0]
204736 blocks super 1.0 [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
404750144 blocks super 1.0 [2/2] [UU]
md1 : active raid1 sdb1[1] sda1[0]
2096064 blocks super 1.1 [2/2] [UU]
unused devices:
Running a read test with hdparm
root [~]# hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads: 664 MB in 3.00 seconds = 221.33 MB/sec
root [~]# hdparm -t /dev/sdb
/dev/sdb:
Timing buffered disk reads: 288 MB in 3.01 seconds = 95.77 MB/sec
But look what happens if I add --direct
root [~]# hdparm --direct -t /dev/sda
/dev/sda:
Timing O_DIRECT disk reads: 788 MB in 3.01 seconds = 262.08 MB/sec
root [~]# hdparm --direct -t /dev/sdb
/dev/sdb:
Timing O_DIRECT disk reads: 534 MB in 3.02 seconds = 176.90 MB/sec
Both tests increase but /dev/sdb doubles while /dev/sda increases maybe 20%. I just don't know what to make of this.
As suggested by Mr. Wagner I've done another read test with dd this time and it confirms the hdparm test:
root [/home2]# dd if=/dev/sda of=/dev/null bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 38.0855 s, 282 MB/s
root [/home2]# dd if=/dev/sdb of=/dev/null bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 115.24 s, 93.2 MB/s
So sda is 3 times faster than sdb. Or maybe sdb is doing also something else besides what sda does. Is there some way to find out if sdb is doing more than what sda does?
UPDATE
Again, as suggested by Mr. Wagner, I have swapped the 2 SSDs. And as he thought it would happen, the problem moved from sdb to sda. So I guess I'll RMA one of the SSDs. I wonder if the cage might be problematic.
What is wrong with this array? Please help!