Hacker News new | ask | show | jobs
by GrayShade 699 days ago
Yeah, in the last year and a half, I've had three btrfs file systems crash on me with the dreaded "parent transid verify failed". Two times it was out of the blue, third time was just after it filled up.

The people on IRC tend to default to "unless you're using an enterprise drive, it's probably buggy and doesn't respect write barriers", which shouldn't have mattered because there was no system crash involved.

Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

2 comments

> Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Just luck. Software can't defend itself against bad RAM. There's always the possibility that bad RAM will cause ZFS to corrupt itself in some way it can't recover itself from.

Everything is in RAM. The kernel, the ZFS code, everything. All of that is vulnerable to corruption. No matter how fancy ZFS is, it can't stop its own code from being corrupted. It's just luck that it didn't happen.

Well, yes and no. The amount of RAM consumed by the filesystem driver is negligible compared to the truckloads of filesystem data shoveled through it. If we assume that errors are comparatively rare, the code itself is unlikely to be affected. Even if you're unlucky enough to get RAM corruption in the 0.01% occupied by the ZFS driver, the chance that a bit will flip in just such a way as to make a checksum succeed when it should have failed due to a second bit flip is virtually nonexistent. Much more likely that it simply crashes in some way. As such ZFS is much more resilient to on-disk filesystem corruption from bad RAM than systems which don't do any checksumming at all.
ECC RAM helps
> For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Be careful though. If whatever data was to be written got corrupted early enough, ie before ZFS got to see it, it happily wrote corrupted data to disk with matching checksum and you're none the wiser. But yes, it didn't blow up the entire Filesystem unlike btrfs likes to do.