Hacker News new | ask | show | jobs
by Xamayon 1416 days ago
A lot of things in computing are overwhelmingly stupid or assume everything will work as expected. I have experienced several data corruption events related to parity data being read incorrectly, not in ZFS, but with hardware and software raid controllers. In one case the hardware raid controller even had ECC memory, but its memory was overheating and thus introducing bad data into calculations when multi bit errors were not correctable. A similarly horrific error condition saw a controller confuse disk IDs in memory and start mirroring one drive to every other drive in the system.
1 comments

Those are not instances of error checking causing data corruption. As I said, "I don't understand how error checking could ever cause data corruption."

Error checking will only ever help you, not hurt you. It doesn’t matter how bad you memory or disk or raid controller is. Error checking won't necessarily save you from those things, but it can in some cases, and it’ll never make things worse.

But they are though, the parity data calcs being corrupted in that first example caused data corruption during a scheduled array check while the system was under unusually heavy load. Error checking is good, and when things are working right it can only help. That is true, but it can't always be counted on if the hardware, software, etc is untrustworthy for whatever reason.
Okay, well I am totally and utterly confused as to how that could ever be possible, regardless of the hardware. You're confident that if not for the data validation the problem wouldn't have occurred?