| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rincebrain 2577 days ago

Assuming the corruption is independent, potentially, but A) even unlikely events are likely to happen for large enough N, and much more importantly, B) as another poster described, if you don't regularly check the integrity, and you have single-disk redundancy, losing a whole disk can likely result in you discovering a block that got mangled some time ago, too late to do anything about it.

There are a number of cases where failures might not be independent, though.

What if, say, you're using multiple drives of the same model, which have a firmware bug causing them to sometimes mangle data on the Nth sector?

What if you're using multiple drives from the same manufacturing batch which have a flaw leading to certain regions being more likely to fail than others?

What if you're using some battery-backed write cache under ZFS (from a HW RAID card or something more exotic), and it helpfully writes out garbage to the same sector on two disks?

What if you have a certain manufacturer's hard drives that lie about flushing their write cache successfully to disk if you issue a SMART request to them between when they put data in cache and when it actually gets to disk, so polling those two disks when they both just got a write results in data loss?

(The last of these is a real firmware bug I ran into - I was running a testbed of a bunch of raidz3 vdevs, and spent some time isolating when zpool scrub kept making the error counters increase even though it had corrected them all...thanks, Samsung HD204UI drives.)