| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marcan_42 1885 days ago

The problem is they weren't just "initial issues". Btrfs was ~8 years old when I started using it as a backend for a production Ceph cluster, on 48 disks (no fancy btrfs redundancy features used, just plain single-disk filesystems with snapshots). I encountered at least two critical issues: filesystems that die on hard power downs (simply issuing a hard shutdown of all machines ended up with two filesystems corrupted beyond repair after booting again), and a snapshot reference leak issue that meant data for deleted snapshots was never freed until a reboot - and even more scarily, fsck reported tons of unfixable errors on those filesystems, although they magically went away after a clean reboot.

Btrfs has not done a good job of inspiring any confidence, many years into development. Thankfully, Ceph has moved on from FS-backed storage to its own implementation on top of raw block devices, and I no longer have Btrfs anywhere in production.

Mind you, I don't trust ZFS either; it does seem to be stabler from Btrfs, but it still suffers from the fundamental issue that all of these "fancy" filesystems do: the fsck/repair tools are never up to par, and there is next to no chance of disaster recovery (with the added drawback that ZFS is not in-tree).

My first experience with one of these "if anything fails, all your data is gone" filesystems was ReiserFS many years ago - 8 bad sectors on a disk killed my home directory and all my data was gone. Since then, I've had rather complex accidents with ext4 and XFS* where I could do manual and automated surgery and recover ~100% of my data. Btrfs and ZFS are in the same class as ReiserFS here. The repair tools just aren't there. Sure, they handle redundancy at the device level like a fancy RAID for "well-behaved" failures like devices just disappearing, but anything outside or their model, or that tickes a bug, and you can well kiss your data goodbye.

Just to give an example: I once recovered an XFS filesystem that was built on top of a RAID6 array which, due to an unfortunate sequence of events, had one drive too many drop out during a replacement, which resulted in me manually stitching together an array where one drive had out-of-date data (i.e. every block out of N was from an earlier point-in-time from the others). Fsck fixed everything, high-level checksums took care of the few files that were being written to and had become corrupted, and I lost nothing of value. On a good filesystem, fsck does its best to recover all existing data and guarantee the result is consistent.

Yes, I know, backups. I have backups. That's not a reason to neglect repair tools. Backups are one layer of defense that can also fail; they are no excuse to neglect FS-level robustness. For example, my off-site backups are bottlenecked on my 1G internet connection, which means that if I have a weird but largely recoverable soft failure, it is much more efficient to rsync data back from the backup, using checksums to avoid data transfer, rather than copy everything again.

And this is why I use CephFS as my "smart" single-host storage solution these days. It has overhead, but it works well, is much more introspectable than ZFS/Btrfs (you can dig through the stack layers if you understand how it works very easily), and I trust its ability to recover from weird failures and device states much more than any RAID solution or fancy multi-device filesystem. It is extremely well engineered.

* I don't recommend XFS either due to kernel implementation performance issues around allocations and such; it was the cause of massive latency issues on my home server for years until I discovered its antics. But at least I've never lost data to XFS. So yeah, just use ext4 if you need a normal filesystem.

1 comments

josephg 1885 days ago

It would be interesting to take each of these filesystems in a simulated environment and zero out stripes of data to see what it would take to kill the disk. All of these fancy filesystems are supposed to have redundancy and error detection in their core structures. But I wonder how well that’s tested - if you simulate single block read failures, are there any blocks that would totally corrupt btrfs or zfs? How about adjacent block pairs?

Seems like a pretty easy test to run and if it found problems, they’d be well worth fixing. (And you could do the test itself pretty efficiently on a ramdisk).

link

kasabali 1885 days ago

You mean something like this?

https://www.unixsheikh.com/articles/battle-testing-data-inte...

link

magicalhippo 1884 days ago

The ZFS project has a test suite[1], and as far as I can determine, nuking data is part of it. See for example the zraid_test tool[2], which seems to do what you suggest.

[1]: https://github.com/openzfs/zfs/tree/master/tests/zfs-tests

[2]: https://github.com/openzfs/zfs/tree/master/cmd/raidz_test (run_rec_check_impl etc)

link