Hacker News new | ask | show | jobs
by Woung1938 1885 days ago
Btrfs is actually in use by some big companies like Facebook but the initial issues seem to linger in people's memory and thus everyone and their cat avoids btrfs like fire. It reminds me of systemd for some reason.

For the record I'm using btrfs on Arch (so recent kernel) for years with no issues (including LUKS encrypted root filesystem and RAID1 arrays for backups).

6 comments

> initial issues

You mean like the current advice not to use anything except mirroring and striping (RAID-0/1/10)?

> Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are two distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever.

* https://btrfs.wiki.kernel.org/index.php/RAID56

I've been using ZFS since it came out on Solaris 10 over a decade ago and it was specifically designed not to have a write hole due to its COW/ACID nature.

See this 2008 SNIA presentation from Bonwick and Moore, the creators of ZFS talking about not having a write hole:

* ACID/COW: https://www.youtube.com/watch?v=NRoUC9P1PmA&t=24m

* Integrity: https://www.youtube.com/watch?v=NRoUC9P1PmA&t=55m20s

(N=1) I have a single-disk laptop running opensuse (tumbleweed) on btrfs. It's the only machine I've ever owned to corrupt its root filesystem beyond repair, and it's done so twice IIRC (definitely 2, maybe 3), within the last few years. It's not just the initial issues.
(N=1) I've had ZFS on an opensolaris system and it got corrupted, and since the ZFS engineers think they are gods who don't make mistakes there was no fsck that would even attempt to repair it. It was a perfectly repairable corruption which I fixed myself with a bit googling and dd to copy some bytes from one location on the disk to another (ZFS apparently keeps multiple copies of some the important data structures that describe the pool, one at the beginning of the block device and one towards the end, kindof as a backup I guess). For btrfs you at least have a decent working fsck, if shit hits the fan. ZFS is like, fuck you we won't even try.
I want to like ZFS, and am using it, but however good it is at not losing data while in operation, the UI feels like it’s designed to make you wreck your data. I guess I just need more practice, but not being able to just rip a drive out and mount it on another machine in a pinch makes me damn nervous. Something about how it’s managed makes the whole file system feel ephemeral, just one bad-but-not-obviously-so command away from being destroyed, and I’m nowhere near being comfortable with that yet (and don’t really see a path to getting to comfort)
> but not being able to just rip a drive out and mount it on another machine in a pinch makes me damn nervous

Why can't you? Granted, you need enough disks to actually have all the data - so ex. if you did RAID0 then yes you need all disks, but say if you did a mirror you can totally just yank a disk out, attach it to another machine, and `zpool import` it.

Can you? I was under the impression that without an “export” beforehand, you can’t.
See the "split" command with OpenZFS 0.8.0+:

* https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSSplitPoolE...

Only with mirrored drives.

Any RAID-Z level would need a full export/import as data is striped, but hot-swap drives can be pulled once things are unmount.

I recently did just this. I had to use the -f flag but it imported just fine on a different computer.

I agree that it can be a bit daunting to operate, there are a few footguns around that, while it might not lead to data loss, but can lead to unfortunate situations.

Just the other day someone on the mailing list had managed to add a single drive as a new top-level vdev to a petabyte pool, rather than adding it as a new spare drive, simply by omitting the word "spare" from the "zpool add" command...

That said, I've been using ZFS at home here with 6+ disks for almost a decade now, and I've never lost data despite lots of various incidents, including lots of power losses and various hardware failures (like disks, mobo and PSU). So overall I'm very happy with it.

Oh no that'd be a terrible design:) AFAIK the most difficulty is that you might have to use `zpool import -f` to force it to ignore the pool not having been cleanly exported.

EDIT: It'd look like this: https://serverfault.com/questions/964075/how-can-i-recover-m...

I've had btrfsck segfault on my a couple times
scrub didn't help? I thought scrub was like fsck for ZFS.
> For btrfs you at least have a decent working fsck

When I say "corrupted beyond repair", I mean "the btrfs tools were not actually helpful".

Another (n=1) anecdote: similarly for me, ~4 years ago my raid-1 workstation OS drive which was, at the time, using btrfs nuked itself without warning or repair. Trying to recover any data was likewise an exercise in rapidly learning about FS internals.

I use zfs on everything now. I am sure at some point it will die horribly, but for now I haven't had a single problem in ~60 managed drives across 3 machines.

The other way to interpret this would be "it was the only filesystem to detect corruption on my malfunctioning hardware", because that's what usually happens in the last few years.

Or have you been using ZFS on the same hardware?

I haven't used ZFS on the same hardware, but during the time when BTRFS ate itself multiple times my home partition on the same drive also on BTRFS was perfectly fine, so it'd have to be an awfully specific hardware failure. Also, it would have had to hit the metadata both times since we're talking "pool wouldn't import" not "it gave me data checksum errors". Which again, is possible, but on an SSD with wear-leveling seems a tad unlikely.
He said "corrupt its root filesystem beyond repair", not "detect checksum errors"

Btw raid5/6 is still broken on btrfs which makes it a hard sell for any system with more than 2 disks. cf. raidz on ZFS

So? Do you think filesystem metadata is stored in a magical pixie cloud, or on the same unreliable physical hardware where it can easily get corrupted, especially after a crash or an unexpected power loss?

I posted this link here already:

https://www.usenix.org/conference/atc19/presentation/jaffer

f2fs (at least in its state a couple of years ago) is/was a prime example of how a filesystem can get into a barely working state with massive amounts of data and metadata corruption, and not even notice it.

God I love this site. In case of a minor disagreement with someone don't even bother to think, just press "downvote".

Corrupt data should be corrected by checksummed btrfs, isn't it?
Only if you have more than one copy of the data
RAID 5 has been generally advised against for years, due to performance issues and the effect of unrecoverable errors during rebuilds.

Btrfs RAID1 works perfectly, and RAID1c3/RAID1c4 provides additional redundancy. In place of RAID5, use RAID10 instead.

raidz2 is only advised against if your arrays are so small that you don't care about the price of storage.

If you want more IOPS, add more raidz2 (raid6) stripes to the pool. In practice, spinning rust is the new tape. Trying to do random access under 1MB is just silly

I don't stress over rebuilds. 2 more disks failing during a rebuild is incredibly unlikely compared to everything else that might force me to restore a backup (software bugs, data center flooding, etc).

I'm running opensuse tumbleweed past couple years on my laptop & desktop, with btrfs as root FS. No issues. I also run btrfsmaintenance script every month, maybe that helps.
SLES is a recommended platform for SAP, and that has had btrfs as the default filesystem since version 12 in 2014.

I'm using btrfs on several systems, laptop, desktop and server, on various configurations of disks.

It has served me well for years, on the server it helped me detect a bad SATA controller. It would work perfectly in light usage, but start introducing errors in heavy usage, which made one disk inconsistent with the others in the storage pool.

Btrfs alerted me to this and after moving the disk to a good controller, I ran btrfs-check --repair on the unmounted disk (after reading the warnings), which got the FS back to a consistent state, remounted the whole pool and ran a btrfs scrub to get everything back in line with itself. The whole process did take a while, but I had backups and wanted to try out the tools. In the end there was no data loss, and the pool is still running perfectly today.

>SLES is a recommended platform for SAP, and that has had btrfs as the default filesystem since version 12 in 2014.

And they specifically tell you to use XFS for any production deployments.

Another anecdote - on working Xeon E3 hardware (so ECC, etc) I have had btrfs corrupt itself as recently as 5.x kernels from just normal use with compression on a single root device. ext4, xfs and zfs work flawlessly on the same hardware.

Furthermore - btrfs feels excessively complicated for simple workflows - if I want to snapshot a btrfs volume without exposing the snapshot to the machine’s view of the file system, I have to do a bunch of volume layout setup first. With ZFS I can just snapshot.

The problems weren't 'initial'. I had to abandon a BTRFS partition that could only mount read-only just two years ago. It wasn't all that long ago that I had two separate installs experience ridiculously bad performance issues just because they had rolling hourly snapshots happening in the background for a few months, but I suppose I could be convinced to restart that test...
The problem is they weren't just "initial issues". Btrfs was ~8 years old when I started using it as a backend for a production Ceph cluster, on 48 disks (no fancy btrfs redundancy features used, just plain single-disk filesystems with snapshots). I encountered at least two critical issues: filesystems that die on hard power downs (simply issuing a hard shutdown of all machines ended up with two filesystems corrupted beyond repair after booting again), and a snapshot reference leak issue that meant data for deleted snapshots was never freed until a reboot - and even more scarily, fsck reported tons of unfixable errors on those filesystems, although they magically went away after a clean reboot.

Btrfs has not done a good job of inspiring any confidence, many years into development. Thankfully, Ceph has moved on from FS-backed storage to its own implementation on top of raw block devices, and I no longer have Btrfs anywhere in production.

Mind you, I don't trust ZFS either; it does seem to be stabler from Btrfs, but it still suffers from the fundamental issue that all of these "fancy" filesystems do: the fsck/repair tools are never up to par, and there is next to no chance of disaster recovery (with the added drawback that ZFS is not in-tree).

My first experience with one of these "if anything fails, all your data is gone" filesystems was ReiserFS many years ago - 8 bad sectors on a disk killed my home directory and all my data was gone. Since then, I've had rather complex accidents with ext4 and XFS* where I could do manual and automated surgery and recover ~100% of my data. Btrfs and ZFS are in the same class as ReiserFS here. The repair tools just aren't there. Sure, they handle redundancy at the device level like a fancy RAID for "well-behaved" failures like devices just disappearing, but anything outside or their model, or that tickes a bug, and you can well kiss your data goodbye.

Just to give an example: I once recovered an XFS filesystem that was built on top of a RAID6 array which, due to an unfortunate sequence of events, had one drive too many drop out during a replacement, which resulted in me manually stitching together an array where one drive had out-of-date data (i.e. every block out of N was from an earlier point-in-time from the others). Fsck fixed everything, high-level checksums took care of the few files that were being written to and had become corrupted, and I lost nothing of value. On a good filesystem, fsck does its best to recover all existing data and guarantee the result is consistent.

Yes, I know, backups. I have backups. That's not a reason to neglect repair tools. Backups are one layer of defense that can also fail; they are no excuse to neglect FS-level robustness. For example, my off-site backups are bottlenecked on my 1G internet connection, which means that if I have a weird but largely recoverable soft failure, it is much more efficient to rsync data back from the backup, using checksums to avoid data transfer, rather than copy everything again.

And this is why I use CephFS as my "smart" single-host storage solution these days. It has overhead, but it works well, is much more introspectable than ZFS/Btrfs (you can dig through the stack layers if you understand how it works very easily), and I trust its ability to recover from weird failures and device states much more than any RAID solution or fancy multi-device filesystem. It is extremely well engineered.

* I don't recommend XFS either due to kernel implementation performance issues around allocations and such; it was the cause of massive latency issues on my home server for years until I discovered its antics. But at least I've never lost data to XFS. So yeah, just use ext4 if you need a normal filesystem.

It would be interesting to take each of these filesystems in a simulated environment and zero out stripes of data to see what it would take to kill the disk. All of these fancy filesystems are supposed to have redundancy and error detection in their core structures. But I wonder how well that’s tested - if you simulate single block read failures, are there any blocks that would totally corrupt btrfs or zfs? How about adjacent block pairs?

Seems like a pretty easy test to run and if it found problems, they’d be well worth fixing. (And you could do the test itself pretty efficiently on a ramdisk).

The ZFS project has a test suite[1], and as far as I can determine, nuking data is part of it. See for example the zraid_test tool[2], which seems to do what you suggest.

[1]: https://github.com/openzfs/zfs/tree/master/tests/zfs-tests

[2]: https://github.com/openzfs/zfs/tree/master/cmd/raidz_test (run_rec_check_impl etc)