Formula to calculate probability of unrecoverable read error during RAID rebuild
- by OlafM
I need to compare the reliability of different RAID systems with either consumer or enterprise drives. The formula to have the probability of success of a rebuild, ignoring mechanical problems, is simple:
error_probability = 1 - (1-per_bit_error_rate)^bit_read
and with 3 TB drives I get
38% probability to experience an URE (unrecoverable read error) for a 2+1 disks RAID5 (4.7% for enterprise drives)
21% for a RAID1 (2.4% for enterprise drives)
51% probability of error during recovery for the 3+1 RAID5 often used by users of SOHO products like Synologys. Most people don't know about this.
Calculating the error for single disk tolerance is easy, my question concerns systems tolerant to multiple disks failures (RAID6/Z2, RAIDZ3 and RAID1 with multiple disks).
If only the first disk is used for rebuild and the second one is read again from the beginning in case or an URE, then the error probability is the one calculated above squared (14.5% for consumer RAID5 2+1, 4.5% for consumer RAID1 1+2). However, I suppose (at least in ZFS that has full checksums!) that the second parity/available disk is read only where needed, meaning that only few sectors are needed: how many UREs can possibly happen in the first disk? not many, otherwise the error probability for single-disk tolerance systems would skyrocket even more than I calculated.
If I'm correct, a second parity disk would practically lower the risk to extremely low values.
Am I correct?