I have a 45-disk array of Seagate Barracuda 3 TB ST3000DM001 (yes these are desktop drives I'm aware of that) in a Supermicro sc847 JBOD, connected via LSI 9285. I have found a solution for the problem description below by reducing speed via
MegaCli -PhySetLinkSpeed -phy0 2 -a0;
for i in $(seq 48); do MegaCli -PhySetLinkSpeed -phy${i} 2 -a0; done
and rebooting.
The question remains: Is this typical for current 6 gb/s equipment? Is this the sad state of SATA storage? Or is some of my equipment (the sff-8088 cables come to mind) bad?
The Problem was:
Synchronizing HW RAID-6, disks kept offlining. Fetching SMART values reveiled that those which offlined did not increase powered-on hours anymore. That is, their firmware (CC4C) seems to crash.
Digging into the matter by switching to Software RAID-6, with the disks passed-through, I got tons of kernel messages scattered across all disks, with 6 gb/s:
sd 0:0:9:0: [sdb] Sense Key : No Sense [current]
Info fld=0x0
sd 0:0:9:0: [sdb] Add. Sense: No additional sense information
And finally, when a disk offlines:
megasas: [ 5]waiting for 160 commands to complete
...
megasas: [35]waiting for 159 commands to complete
...
megasas: [155]waiting for 156 commands to complete
...
megaraid_sas: pending commands remain after waiting, will reset adapter.
Ugly controller reset here, then minutes later:
megaraid_sas: Reset successful.
sd 0:0:28:0: Device offlined - not ready after error recovery
...
sd 0:0:28:0: [sdu] Unhandled error code
sd 0:0:28:0: [sdu] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
sd 0:0:28:0: [sdu] CDB: Read(10): 28 00 23 21 2f 40 00 00 70 00
sd 0:0:28:0: [sdu] killing request
Reduced speed to 3 gb/s like written above, all problems vanished.