Hacker News new | ask | show | jobs
by eptcyka 699 days ago
In 2019, btrfs ate all my data after a power cut. Btrfs peeps said it sounded like my SSD was at fault. Well, ZFS is still chugging along on that drive. I am not surprised btrfs took ages to stabilize, and it will take ages again before I rely on it. I’ve had previous btrfs incidents too. I think the argument against btrfs is that it was not good enough when btrfs devs told people to use it in production for ages.
3 comments

Anecdotally and absolutely not production experience here, but I've had a Synology device running btrfs for 7 or 8 years now. Only issue I ever had is when I shipped it cross country with the drives in it, but was able to recover just fine.

This includes plenty of random power losses.

They do use btrfs. However, Synology also uses some additional tools on top of btrfs. From what I remember (could be wrong about the precise details), they actually run mdadm on top of btrfs, and use mdadm in order to get the erasure coding and possibly the cache NVME disk too. (By erasure coding, I mean RAID 5/6, or SHR, which are still unstable generally in BTRFS).
I assume you mean running btrfs on top of md (mdadm) or dm (dmraid), not the other way around?
Woops, you are correct! And it looks like it is dmraid, not mdadm.

https://daltondur.st/syno_btrfs_1/

Sorry about that!

Yeah, in the last year and a half, I've had three btrfs file systems crash on me with the dreaded "parent transid verify failed". Two times it was out of the blue, third time was just after it filled up.

The people on IRC tend to default to "unless you're using an enterprise drive, it's probably buggy and doesn't respect write barriers", which shouldn't have mattered because there was no system crash involved.

Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

> Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Just luck. Software can't defend itself against bad RAM. There's always the possibility that bad RAM will cause ZFS to corrupt itself in some way it can't recover itself from.

Everything is in RAM. The kernel, the ZFS code, everything. All of that is vulnerable to corruption. No matter how fancy ZFS is, it can't stop its own code from being corrupted. It's just luck that it didn't happen.

Well, yes and no. The amount of RAM consumed by the filesystem driver is negligible compared to the truckloads of filesystem data shoveled through it. If we assume that errors are comparatively rare, the code itself is unlikely to be affected. Even if you're unlucky enough to get RAM corruption in the 0.01% occupied by the ZFS driver, the chance that a bit will flip in just such a way as to make a checksum succeed when it should have failed due to a second bit flip is virtually nonexistent. Much more likely that it simply crashes in some way. As such ZFS is much more resilient to on-disk filesystem corruption from bad RAM than systems which don't do any checksumming at all.
ECC RAM helps
> For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Be careful though. If whatever data was to be written got corrupted early enough, ie before ZFS got to see it, it happily wrote corrupted data to disk with matching checksum and you're none the wiser. But yes, it didn't blow up the entire Filesystem unlike btrfs likes to do.

Btrfs never actually stabilized it's still garbage compared to ZFS
Care to substantiate that statement? It seems rather arbitrary to just say that it's garbage when it is running and has been running successfully for the vast majority of its users. It also offers two features that ZFS does not: the ability to grow a pool, and offline duplication.
Based on the reports of corruption and data loss from actual users, I don’t think this claim is true at all.
Does it even have RAID5?
Why should it matter? It's an extremely niche technology that's only interesting to some home users. I see no reasons why other users should care about a RAID level they're not interested in.

(I don't use btrfs or any other COW filesystem because of significantly worse performance with some kinds of workloads, but it has nothing to do with maturity of any of them.)

> Why should it matter? It's an extremely niche technology that's only interesting to some home users.

I use RAID-Z2 in lots of places for bulk storages purposes (HPC).

There's a reason why Ceph added erasure coding:

* https://ceph.io/en/news/blog/2017/new-luminous-erasure-codin...

* https://docs.ceph.com/en/latest/rados/operations/erasure-cod...

When you're talking about PB of data, storage efficiencies add up.

Wtf? This is a bizzare take. Facebook poured in millions of dollars of R&D into btrfs.
but they likely put money into features which they are interested in, and not raid56
Yes but you will lose data if you are writing to your array when the power goes out. RAIDZ (ZFS) does not have this problem. See BTRFS RAID5 write hole.