Hacker News new | ask | show | jobs
by scaladev 2030 days ago
>and 2 on an openSUSE tubleweed system using BTRFS as root.

How long ago was that? and have you been using other fully checksummed filesystems (like ZFS) on that hardware since then? I'm asking because if you're using btrfs without any raid features (or with simple RAID modes like 1/0) for the past several years and it breaks, if you dig deep enough into the problem, often the hardware is found to be at fault.

And ext4 or xfs either don't find corruption at all (if it's data corruption), or have better error recovery if the FS's own metadata got trashed (which is a strong argument in favor of them, I agree, but I wouldn't trust such a filesystem anyway and would restore from backups right away).

Edit: it's a strong argument for storing data on them which is checksummed by some higher component in your software stack, like the database. Otherwise, you're just asking for silent data bitrot.

3 comments

> if you dig deep enough into the problem, often the hardware is found to be at fault

That's not really good enough though. Next gen file systems are supposed to be resilient even if hardware fails. That's the whole point of raiding and checksumming. ZFS was very much intended to be resilient when faced with bad hardware. Heck, even in the 90s this was a known problem hence chkdsk on DOS marking bad sectors to somewhat mitigate data corruption on FAT file systems. If Btrfs only works when hardware is behaving then that is absolutely a problem with Btrfs.

As for my experience with ZFS, it's kept consistency when disks have died. It's worked flawlessly when SATA controllers have died (one motherboard would randomly drop HDDs when the controllers experienced high IOPS -- which would be enough to trash any normal file system but ZFS survived it with literally no data loss). Not to mention frequent unscheduled power cuts, kernel panics (unrelated to ZFS), and so on and so forth. I'm sure it's possible to trash a ZFS volume but it's stood strong on some pretty dubious hardware configurations for me and where most other file systems would have failed.

Letsee, this root filesystem says it was installed on 2019-05-11, so what, a year and a half ago? ish? I just wiped it and reinstalled since only the root filesystem was hosed (separate home filesystem thankfully wasn't affected) and this box was already fully managed by ansible so I just rebuilt an exact replica of the same system in place. (In hindsight, no, I don't know why I didn't use that opportunity to switch to XFS.)

Also, I'm going to somewhat mirror sibling comments: Even if the hardware is faulty, that should produce a filesystem with explicit checksum errors, not an unreadable filesystem. There is certainly an upper limit to what it could catch, but you'll have to forgive my skepticism that only one of the 2 filesystems on the system was affected and only after months of use, and then the corruption was so complete that it couldn't even tell me what was wrong and try to fix it.

> if you dig deep enough into the problem, often the hardware is found to be at fault

Well with ZFS I've had hardware break and still not experienced any data loss. I've had cables getting lose multiple times, I've had several disks dying[1], I've had unstable SATA controllers (hello JMicron) and plenty of unexpected power losses and hard resets.

Yet ZFS has sailed through it all with my data intact. Sure ZFS ain't bulletproof. It can get messed up. But for the most part it takes a lot of beating without a dent.

[1]: As a matter of fact, I just finished resilvering a RAID-Z1 pool in my NAS after a WD Red 3TB died after almost 7 years of 24/7 operation (barring a few accidental power outages).