Hacker News new | ask | show | jobs
by Xamayon 1405 days ago
I did not mention ZFS specifically. If ZFS has better handling of this kind of thing, that's great, but if you can't trust your memory to be correct you can't trust the data in buffers, the data being hashed, or the data being read from or written out to disk. Additionally, you can't trust the filesystem to behave in the ways that it should. There are many kinds of memory errors, some may for example impact certain data sequences in a fairly deterministic way. Some are completely random, some can be triggered by users or attackers.
1 comments

Unless the filesystem is behaving in a way that is overwhelmingly stupid, the basic logic should still apply. I don't understand how error checking could ever cause data corruption. It might let you know about data corruption which would otherwise have gone unnoticed, but that's not the same thing.

If there is a filesystem that is dumb enough to cause corruption during the checksumming process, please let me know which one, so I can be sure to never ever ever go anywhere near it. :)

A lot of things in computing are overwhelmingly stupid or assume everything will work as expected. I have experienced several data corruption events related to parity data being read incorrectly, not in ZFS, but with hardware and software raid controllers. In one case the hardware raid controller even had ECC memory, but its memory was overheating and thus introducing bad data into calculations when multi bit errors were not correctable. A similarly horrific error condition saw a controller confuse disk IDs in memory and start mirroring one drive to every other drive in the system.
Those are not instances of error checking causing data corruption. As I said, "I don't understand how error checking could ever cause data corruption."

Error checking will only ever help you, not hurt you. It doesn’t matter how bad you memory or disk or raid controller is. Error checking won't necessarily save you from those things, but it can in some cases, and it’ll never make things worse.

But they are though, the parity data calcs being corrupted in that first example caused data corruption during a scheduled array check while the system was under unusually heavy load. Error checking is good, and when things are working right it can only help. That is true, but it can't always be counted on if the hardware, software, etc is untrustworthy for whatever reason.
Okay, well I am totally and utterly confused as to how that could ever be possible, regardless of the hardware. You're confident that if not for the data validation the problem wouldn't have occurred?