This is similar to 3 drives fell out of Raid6 mdadm - rebuilding? except that it is not due to a failing cable. Instead the 3rd drive fell offline during rebuild of another drive.
The drive failed with:
kernel: end_request: I/O error, dev sdc, sector 293732432
kernel: md/raid:md0: read error not correctable (sector 293734224 on sdc).
After rebooting both these sectors and the sectors around them are fine. This leads me to believe the error is intermittent and thus the device simply took too long to error correct the sector and remap it.
I expect that no data was written to the RAID after it failed. Therefore I hope that if I can kick the last failing device online that the RAID is fine and that the xfs_filesystem is OK, maybe with a few missing recent files.
Taking a backup of the disks in the RAID takes 24 hours, so I would prefer that the solution works the first time.
I have therefore set up a test scenario:
export PRE=3
parallel dd if=/dev/zero of=/tmp/raid${PRE}{} bs=1k count=1000k ::: 1 2 3 4 5
parallel mknod /dev/loop${PRE}{} b 7 ${PRE}{} \; losetup /dev/loop${PRE}{} /tmp/raid${PRE}{} ::: 1 2 3 4 5
mdadm --create /dev/md$PRE -c 4096 --level=6 --raid-devices=5 /dev/loop${PRE}[12345]
cat /proc/mdstat
mkfs.xfs -f /dev/md$PRE
mkdir -p /mnt/disk2
umount -l /mnt/disk2
mount /dev/md$PRE /mnt/disk2
seq 1000 | parallel -j1 mkdir -p /mnt/disk2/{}\;cp /bin/* /mnt/disk2/{}\;sleep 0.5 &
mdadm --fail /dev/md$PRE /dev/loop${PRE}3 /dev/loop${PRE}4
cat /proc/mdstat
# Assume reboot so no process is using the dir
kill %1; sync &
kill %1; sync &
# Force fail one too many
mdadm --fail /dev/md$PRE /dev/loop${PRE}1
parallel --tag -k mdadm -E ::: /dev/loop${PRE}? | grep Upda
# loop 2,5 are newest. loop1 almost newest => force add loop1
Next step is to add loop1 back - and this is where I am stuck.
After that do a xfs-consistency check.
When that works, check that the solution also works on real devices (such a 4 USB sticks).