Hello all
I'm trying to optimize a storage setup on some Sun hardware with Linux. Any thoughts would be greatly appreciated.
We have the following hardware:
Sun Blade X6270
2* LSISAS1068E SAS controllers
2* Sun J4400 JBODs with 1 TB disks (24 disks per JBOD)
Fedora Core 12
2.6.33 release kernel from FC13 (also tried with latest 2.6.31 kernel from FC12, same results)
Here's the datasheet for the SAS hardware:
http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf
It's using PCI Express 1.0a, 8x lanes. With a bandwidth of 250 MB/sec per lane, we should be able to do 2000 MB/sec per SAS controller.
Each controller can do 3 Gb/sec per port and has two 4 port PHYs. We connect both PHYs from a controller to a JBOD. So between the JBOD and the controller we have 2 PHYs * 4 SAS ports * 3 Gb/sec = 24 Gb/sec of bandwidth, which is more than the PCI Express bandwidth.
With write caching enabled and when doing big writes, each disk can sustain about 80 MB/sec (near the start of the disk). With 24 disks, that means we should be able to do 1920 MB/sec per JBOD.
multipath {
rr_min_io 100
uid 0
path_grouping_policy multibus
failback manual
path_selector "round-robin 0"
rr_weight priorities
alias somealias
no_path_retry queue
mode 0644
gid 0
wwid somewwid
}
I tried values of 50, 100, 1000 for rr_min_io, but it doesn't seem to make much difference.
Along with varying rr_min_io I tried adding some delay between starting the dd's to prevent all of them writing over the same PHY at the same time, but this didn't make any difference, so I think the I/O's are getting properly spread out.
According to /proc/interrupts, the SAS controllers are using a "IR-IO-APIC-fasteoi" interrupt scheme. For some reason only core #0 in the machine is handling these interrupts. I can improve performance slightly by assigning a separate core to handle the interrupts for each SAS controller:
echo 2 /proc/irq/24/smp_affinity
echo 4 /proc/irq/26/smp_affinity
Using dd to write to the disk generates "Function call interrupts" (no idea what these are), which are handled by core #4, so I keep other processes off this core too.
I run 48 dd's (one for each disk), assigning them to cores not dealing with interrupts like so:
taskset -c somecore dd if=/dev/zero of=/dev/mapper/mpathx oflag=direct bs=128M
oflag=direct prevents any kind of buffer cache from getting involved.
None of my cores seem maxed out. The cores dealing with interrupts are mostly idle and all the other cores are waiting on I/O as one would expect.
Cpu0 : 0.0%us, 1.0%sy, 0.0%ni, 91.2%id, 7.5%wa, 0.0%hi, 0.2%si, 0.0%st
Cpu1 : 0.0%us, 0.8%sy, 0.0%ni, 93.0%id, 0.2%wa, 0.0%hi, 6.0%si, 0.0%st
Cpu2 : 0.0%us, 0.6%sy, 0.0%ni, 94.4%id, 0.1%wa, 0.0%hi, 4.8%si, 0.0%st
Cpu3 : 0.0%us, 7.5%sy, 0.0%ni, 36.3%id, 56.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 1.3%sy, 0.0%ni, 85.7%id, 4.9%wa, 0.0%hi, 8.1%si, 0.0%st
Cpu5 : 0.1%us, 5.5%sy, 0.0%ni, 36.2%id, 58.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 5.0%sy, 0.0%ni, 36.3%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 5.1%sy, 0.0%ni, 36.3%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.1%us, 8.3%sy, 0.0%ni, 27.2%id, 64.4%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.1%us, 7.9%sy, 0.0%ni, 36.2%id, 55.8%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 7.8%sy, 0.0%ni, 36.2%id, 56.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 7.3%sy, 0.0%ni, 36.3%id, 56.4%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 5.6%sy, 0.0%ni, 33.1%id, 61.2%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.1%us, 5.3%sy, 0.0%ni, 36.1%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 4.9%sy, 0.0%ni, 36.4%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.1%us, 5.4%sy, 0.0%ni, 36.5%id, 58.1%wa, 0.0%hi, 0.0%si, 0.0%st
Given all this, the throughput reported by running "dstat 10" is in the range of 2200-2300 MB/sec.
Given the math above I would expect something in the range of 2*1920 ~= 3600+ MB/sec.
Does anybody have any idea where my missing bandwidth went?
Thanks!