Why are SMART error rates going down?
- by Jeff Shattock
I have a hard drive that's part of a Linux software raid5 array. SMART has reported that its multi_zone_error_rate was 0, then 1, then 3. So I figured I better start backing up more frequently and prepare to replace the drive. Now, today, the multi_zone_error_rate of that very same drive is back down to 1. It seems that 2 errors unhappened while I wasn't looking.
I've also seen simliar behaviour by inspecting the syslog on the server.
Jun 7 21:01:17 FS1 smartd[25593]: Device: /dev/sdc, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 7 21:01:17 FS1 smartd[25593]: Device: /dev/sde, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 7 21:01:18 FS1 smartd[25593]: Device: /dev/sdg, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 8 02:31:18 FS1 smartd[25593]: Device: /dev/sdg, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Jun 8 03:01:17 FS1 smartd[25593]: Device: /dev/sdc, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Jun 8 03:01:17 FS1 smartd[25593]: Device: /dev/sde, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
These are raw values, not the human-useful values that smartctl -a produces, but the behaviour is similar: error rates changing, then undoing the change. None of these are the drive that had the multi_zone weirdness. I haven't seen any problems from the RAID; its most recent scrub ( < 24 hours ago) came back totally clean.
The only thing I can think of is that the SMART reporting circuitry on the drive isn't working properly all the time. The cables are in tight on the drive and board. What's going on here?