Hacker News new | ask | show | jobs
by kipari 2564 days ago
I reckon that the chance of the same two blocks on two different disks failing between ZFS scrubs would be incredibly small.
3 comments

Assuming the corruption is independent, potentially, but A) even unlikely events are likely to happen for large enough N, and much more importantly, B) as another poster described, if you don't regularly check the integrity, and you have single-disk redundancy, losing a whole disk can likely result in you discovering a block that got mangled some time ago, too late to do anything about it.

There are a number of cases where failures might not be independent, though.

What if, say, you're using multiple drives of the same model, which have a firmware bug causing them to sometimes mangle data on the Nth sector?

What if you're using multiple drives from the same manufacturing batch which have a flaw leading to certain regions being more likely to fail than others?

What if you're using some battery-backed write cache under ZFS (from a HW RAID card or something more exotic), and it helpfully writes out garbage to the same sector on two disks?

What if you have a certain manufacturer's hard drives that lie about flushing their write cache successfully to disk if you issue a SMART request to them between when they put data in cache and when it actually gets to disk, so polling those two disks when they both just got a write results in data loss?

(The last of these is a real firmware bug I ran into - I was running a testbed of a bunch of raidz3 vdevs, and spent some time isolating when zpool scrub kept making the error counters increase even though it had corrected them all...thanks, Samsung HD204UI drives.)

It is incredibly small if you don't consider either drive failing. But if one drive fails, it happens with some regularity that a sector on the good drive is bad. In actuality, only one sector is bad, but in effect the dead drive means its mirror is also bad.

This comes up on the linux raid list with some frequency whenever there are drive failures with raid56, and the subsequently the raid trips over a single bad sector.

But it's true that lack of scrubbing contributes to this scenario, as well as the terrible combination of consumer drives with very high bad sector recovery times and the Linux SCSI command timer default of 30 seconds. That combination ends up causing a masking of bad sectors that end up not getting repaired, and as a user you may not realize that the link resets are not normal and suggest a bad sector as the cause.

Are you saying that a failure happens which isn’t detected and when the 2nd failure occurs we notice because the data is inaccessible?

Which raid s/w does this ?

Correct. All that depend on the SCSI block layer, which includes libata and thus common consumer SATA drives. A NAS or better drive will come out of the box with short error time outs, typically 70 deciseconds, and quickly issue a read error with the LBA of the offending bad sector, and the RAID can then know to obtain a copy or reconstruct from parity, write the good data to the bad sector thus fixing it. Either the write works, or if it fails the drive firmware is responsible for remapping that LBA to a reserve physical sector.

In the case where the drive error timeout is longer than the SCSI block layer, it just results in a link reset. The actual problem with the drive is obscured by the reset, including the bad sector, so it never gets repaired.

Btrfs, mdadm, lvm are affected and I'm pretty sure ZFS on Linux as well assuming they haven't totally reimplemented their own block layer outside of the SCSI subsystem.

It's a super irritating problem, the kernel developers know all about it, but thus far it's considered something distributions should change for the use cases that need it. And what that means so far is distros don't change it and users using consumer drives with high error recovery times, get bitten.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

The link you posted talks about the raid software kicking a whole disk out of the raid array when the disk takes too long to respond (basically but not exactly) due to 2 timeout variables mismatch

The post I was responding to implied a raid array could be degraded and you wouldn’t know till it completely failed

Interesting nevertheless

Yes, over normal timescales. A lot can happen in a thousand years.
Thousands of years is a lot of scrubs and a lot of disk replacements, though. And a solution like ZFS, properly monitored, should help make those detections and repairs happen early, with lower odds of loss.

Although honestly in a thousand year timeframe I very much doubt humanity will preserve ZFS, gzip, tar, jpeg, PNG, ASCII, today's spoken and written languages in current form, etc. Just as written material from 1000 years ago is not very accessible to most people; with the original material you need intense study before you even know what you're looking at.