What happens to missed writes after a zpool clear?

Posted by Kevin on Server Fault See other posts from Server Fault or by Kevin
Published on 2013-06-19T08:32:52Z Indexed on 2013/06/26 22:23 UTC
Read the original article Hit count: 224

Filed under:
|

I am trying to understand ZFS' behaviour under a specific condition, but the documentation is not very explicit about this so I'm left guessing.

Suppose we have a zpool with redundancy. Take the following sequence of events:

  1. A problem arises in the connection between device D and the server. This causes a large number of failures and ZFS therefore faults the device, putting the pool in degraded state.

  2. While the pool is in degraded state, the pool is mutated (data is written and/or changed.)

  3. The connectivity issue is physically repaired such that device D is reliable again.

  4. Knowing that most data on D is valid, and not wanting to stress the pool with a resilver needlessly, the admin instead runs zpool clear pool D. This is indicated by Oracle's documentation as the appropriate action where the fault was due to a transient problem that has been corrected.

I've read that zpool clear only clears the error counter, and restores the device to online status. However, this is a bit troubling, because if that's all it does, it will leave the pool in an inconsistent state!

This is because mutations in step 2 will not have been successfully written to D. Instead, D will reflect the state of the pool prior to the connectivity failure. This is of course not the normative state for a zpool and could lead to hard data loss upon failure of another device - however, the pool status will not reflect this issue!

I would at least assume based on ZFS' robust integrity mechanisms that an attempt to read the mutated data from D would catch the mistakes and repair them. However, this raises two problems:

  1. Reads are not guaranteed to hit all mutations unless a scrub is done; and

  2. Once ZFS does hit the mutated data, it (I'm guessing) might fault the drive again because it would appear to ZFS to be corrupting data, since it doesn't remember the previous write failures.

Theoretically, ZFS could circumvent this problem by keeping track of mutations that occur during a degraded state, and writing them back to D when it's cleared. For some reason I suspect that's not what happens, though.

I'm hoping someone with intimate knowledge of ZFS can shed some light on this aspect.

© Server Fault or respective owner

Related posts about zfs

Related posts about software-raid