Hacker News new | ask | show | jobs
by cestith 1580 days ago
Personally I've always really liked ReiserFS. The collected storage of small files, the storage of the tails of files, the balanced tree structure, and the fast journaling gave better performance in reads, writes, boot time, and storage efficiency than most file systems. The focus on small files, which were and arguably still are most files on a typical system, was a big key to this. It made it ideal for storing things like configuration files, email in maildir format, email spools, version control repositories, source directories, and many executables. The logical successor was supposed to be btrfs, but that project IMHO may never be ready for production use.

All that said, if it's hurting kernel development and almost nobody is using it, perhaps a deprecation cycle is due. Maybe bcachefs is a good replacement? Or perhaps nobody cares about efficiently storing small files these days at all, and we just all go to XFS, ext4, and ZFS. I think dropping it during a short period is detrimental to users. Maybe a two or three version warning is due.

4 comments

It's great for small files, but it's also got some really nasty little corner cases. You do not store ReiserFS filesystem images on ReiserFS filesystems without encryption, or the next time you run fsck, you'll end up with the two filesystems oddly merged into something entirely useless. I'm not sure if that's been fixed, I got a bit tired of the exotica for general daily driver systems, and ext3/ext4 cover my needs well enough.

I'm not sure btrfs is a sane replacement, though. It's exceedingly complex, and I try to avoid needless complexity where possible.

That's one heck of an edge case! How does fsck manage to completely confuse filesystem metadata and data like this? Surely this is only if there's corruption in the (outer) filesystem, right?
I think it was the --rebuild-tree argument, which as I understand tries to fix an otherwise completely broken filesystem by searching for anything that looks like metadata and gluing it back together.

I've not looked at reiserfs in many years though, so I could be mis-remembering here.

Seems like something pretty fixable too... Only scan areas of the disk which aren't part of already valid filesystem structures, including the users files.
It's kind of the point of fsck that it scans areas of the disk which aren't part of already valid filesystem structures. It's what you run when your filesystem structure is invalid in order to recover as much as you can.
And then the user deletes that file, and it's not a part of a file anymore, but is still stored somewhere on a disk.
ResierFS allow metadata in everywhere. That’s how they allow sharing a ext2fs and resierfs on same disk during conversion
No, the meta-data is somehow distributed and fsck can pick up pieces of it. I don't think you have to encrypt the image to avoid the edge case, a compress of the image should be enough. It still sucks though.
How do other filesystems avoid this? It seems like a hard problem in general.
Many other filesystems have one or more arrays for metadata. Sometimes these are formatted like files themselves. So a tool to recover it only has to guess at those contiguous arrays, instead of considering essentially any block metadata.
> That's one heck of an edge case!

Yeah... especially if you tend to back up systems when you're rebuilding or replacing them by just backing up the whole disk. It's got everything, you can mount it as a loop device (which you can't do with a compressed image as someone else mentions), etc. I've got a lot of random filesystem images around, and once I trashed a fs entirely with Reiser images on Reiser, I was done with it.

I feel like you really need to have a reason to use ext3/4. I stick to xfs for daily stuff and btrfs if I want any sort of extra functionality. It's super stable in basic configs.

For example, I have a /data drive formated xfs, then I mount a /backup formated btrfs, rsync the 2, take a btrfs snapshot, and unmount /backup.

what makes xfs better than ext4 as a general purpose filesystem?
xfs dynamically allocates inode storage space. A very typical failure scenario of ext4 is to run out of inode quota while there are plenty of free blocks available. This scenario is not possible with xfs. This alone makes it more dependable than ext4 in practice.

There used to be major performance variances where xfs was better in certain workloads and ext4 in others, but those appear to have been smoothed out and now their performance seems very similar across all workloads.

Lack of support for shrinking in XFS is one reason that regularly comes up for me
What use case regularly shrinks file systems? Not doubting, curious!

  > For example, I have a /data drive formated xfs, then I mount a /backup formated btrfs, rsync the 2, take a btrfs snapshot, and unmount /backup.
I also unmount the backup drive(s) when not in use, but I'm just cp'ing the important directories from one ext4 filesystem to another. I'd consider btrfs or xfs for general use, I'd love to know why you choose xfs for everyday use. Sure, it's better than ext4, but why not btrfs all around?
Mostly legacy. I should convert it.
I once formatted an external drive with xfs, wanted to repurpose some of the space to be readable on Windows, found out you couldn't resize xfs, reformatted the drive as exfat (lol you can't resize that either), and lost my last copy of some past files.
From the same standpoint, btrfs is not sane either. You cannot dd it to a different block device, even when unmounted. When the kernel sees two btrfs filesystems with the same UUID on two different block devices, it thinks they are one filesystem, and corrupts both.
The difference is that you can easily edit the UUID of a btrfs filesystem with `btrfstune -U` immediately after making the copy so the filesystems are unique again. You need to do this anyway if you have mount rules based on the UUID, regardless of the filesystem type, or you run the risk of mounting the wrong device. It is also possible to tell btrfs which devices make up a particular filesystem with the device= mount option, which ought to bypass the default grouping based on the UUID.
Can one wait indefinitely before running `btrfstune -U`? The "immediately" in your comment makes it sound like a racy/flaky workaround, assuming you are referring to immediacy in time.
You can wait until just before the next time the device is scanned for filesystems, or perhaps when the next btrfs filesystem is mounted. Beyond that it will start to cause problems.
Damn, I was bit by this, I had forgotten about it because it was such a long time ago.
This blew my mind. The bug is actually the main intended behaviour.
> The logical successor was supposed to be btrfs, but that project IMHO may never be ready for production use.

https://btrfs.wiki.kernel.org/index.php/Production_Users

Facebook deployed it on millions of servers. Is that production enough? Synology NAS devices also use it.

Facebook doesn’t use it’s servers the same way we use our computers. They image machines in and out of existence. They don’t have file systems going through power loss on a weekly basis. They don’t upgrade the kernel on existing installations. They don’t expand their storage after the fact. If their machines fail, they don’t care - they’re completely fungible.
> If their machines fail, they don’t care

This myth is being perpetuated despite btrfs devs (who work at facebook) stating the exact opposite many times over.

Every FS corruption and weird behavior is put aside and investigated. They very much do care.

https://lwn.net/ml/fedora-devel/03fbbb9a-7e74-fc49-c663-3272...

Please read the whole thread before repeating this nonsense, or at least every email sent there by Josef Bacik.

See also:

https://lwn.net/Articles/824855/

https://lwn.net/Articles/824620/

> Every FS corruption and weird behavior is put aside and investigated. They very much do care.

Just because you and I are using different meanings of the word "care" doesn't mean the point isn't valid. They "care" in that they would like to know what went wrong and study it further. They don't "care" in the sense that they suffered no real harm and no stakes were riding on any one particular server that failed. It's not just a matter of having a backup/redundancy, it's about having automated systems (or even just standard procedures that are being executed on a daily basis at that scale one way or the other) that take care of these failures. So even in production, "regular" btrfs users might have backups so "no lasting damage" would be incurred, but that's hardly the same as openly volunteering themselves for risk.

That's all besides the main point: Facebook is deploying "known good" configurations. They're using a very select subset of features. They're not trusting changed btrfs features/implementations being correct or, as was my experience, worrying about less-used/tested codepaths leading to data loss.

As a tl,dr:

“Also keep in mind we pay really close attention to burn rates for our drives, because obviously at our scale it translates to millions of dollars. Btrfs has improved our burn rates with the compression, as the write amplification goes drastically down, thus extending the life of the drives.”

As with anything it comes down to money. Yes a machine going down doesn’t impact the cluster but it does impact their wallet. Every failure of a disk costs money and on the scale of the big boys that can add up to big money.

So while “the system” doesn’t care about drive failures the accountants and CFO’s absolutely care.

Just pointing out that "caring about physical drive failure" and "caring about disk corruption or data loss" are completely independent and the latter does not directly equate big money (as there are already systems and SOP in place to deal with handling failed servers). Btrfs isn't notorious for actually frying disks, just the data on them.
Do they care about the FS just silently eating data? I ask because btrfs has been known to do that. Sure, you're not replacing the drive, but you're probably wiping the VM's disk image and creating a new one.
OP said production use. Can you define production use then?
The thing of recreating VMs a lot instead of upgrading or keeping them a long time is production use. The whole point of VMs, aside from not taking 3 months to order and provision, is that you can put the "long-term maintenance of a disk and OS" cost to zero and just recreate from SoR (hopefully git) whenever something needs to change. If you are editing state on persistent VMs, you are missing some really nice features of VM based deployment. It's like containers but more well understood and possibly more cost efficient (depending on the code).
Lots of people seem to turn up their nose at btrfs. Is there a reason for that? Was it perhaps launched before it was really ready and people still remember early versions?
> Is there a reason for that?

I can give you mine. I was working with a Raspberry Pi 3, and using a USB SSD. It's a USB2 link, so a bit choked, and I figured, hey, filesystem compression can help here, btrfs supports it, great! And it helped - you could get "real world" disk reads a good bit faster than the USB2 bus speed.

Until one day, I rebooted, and it didn't come back up. Analysis on another system was that the btrfs filesystem was just... toast. I've no idea what happened, I found some stuff that said "Oh, uh... don't use btrfs over USB, it kinda breaks in some cases...", the recovery tools couldn't even decide that the filesystem was a btrfs filesystem, and, nope.

I put data on the filesystem, I expect it to come back. btrfs broke that guarantee with a Pi full of data (nothing too important, they're just scratch systems and light desktops), so... I now stick to the boring things like ext4 that have been exceedingly well proven. Is it the best filesystem out there in terms of features? Certainly not. Am I pretty darn sure that I'm not going to trip some edge case and totally scramble the filesystem? Yes, and that's what I care about.

Lots of us got burnt with data loss and aren’t willing to give it a chance again. Maybe it’s better now? I don’t have a reason to give it a second chance when there are plenty of stable alternatives that have saved my ass I’m the past instead of telling me I’m SOL.
That's exactly it. I've used btrfs in production since Ubuntu 10.04, at scale since 12.04, and had nothing but great experiences with it - especially with the seed volume functionality, which allowed me to build the foundation for a major container-as-a-service platform before Docker was a thing. btrfs never lost our data, but I've also seen way too many btrfs kernel panics that were clearly related to insufficiently mature filesystem code, and I can understand people who did lose data, got burned and never want to trust btrfs again.
In their earlier days, the ENOSPCE bug corrupts the filesystem.

If you do a heavy random write workload, it fills up the disk pretty quickly and require a re-balance _before_ ran out of space.

Of cause you can do nocow on those files, but than it lost all the checksuming/snapshotting features.

For me, it was https://bugzilla.kernel.org/show_bug.cgi?id=85581. Yes this endless-write loop is long fixed, but, given that something with 99% similar symptoms has surfaced in kernel 5.16 (or was this original bug not fixed properly?), I would say no.
It is a complex beast. It needs some maintenance and performance will degrade without it.

I've never lost data to it, I've never tried the soft RAID modes it has though, but I've experienced it making a system almost unusably slow. SUSE out of the box with it automates a lot of it and it's pretty remarkable. Transactional mode if you want it seems like a game changer for some servers and the snapper stuff has saved my bacon a couple times. It's getting there but like I said, it needs some maintenance and just formatting a partition with it is likely the wrong way to experience it.

For me, when I tried btrfs (which was about 10 years ago now) I discovered it was extremely slow. And not like 50% slower—when I switched to ext4 or xfs on the same disk with the same data I was getting a 10x or so speedup.
AFAIK it's not so bad in single-device use-cases. I think most of the more recent failures I've heard about have all had to do with Btrfs RAID. The prevailing wisdom still seems to be that if you want to use RAID, use an md soft-RAID device or LVM under your single-device Btrfs filesystem.
RAID, especially 5 or 6 was my main concern, yes. If I'm using hardware RAID or a soft RAID under the FS, much of the promised benefit of btrfs is gone anyway. I can add to storage pools with ZFS or expand an LVM set, too, but what does using btrfs on top of anything buy me that ZFS, bcachefs, or something like f2fs does not?
> but what does using btrfs on top of anything buy me that ZFS, bcachefs, or something like f2fs does not?

Well, inclusion in mainline kernels is the big one over ZFS and bcachefs, I guess.

I haven't seen F2FS before, so I'm commenting on the basis of 30 seconds of Googling, here, but it doesn't look like it supports either copy-on-write or snapshots, which are the big selling points I've heard for continuing to use Btrfs on top of a device manager.

Yes, the only real problems are edge cases. I use it all the time.
All problems are edge cases, to some degree or another. The only real question is how far out those edges are, and whether users are likely to bump into them.
Edge cases like Raid5/6 which had the write hole issue approximately a decade after btrfs was released. At some point you say "This filesystem has lost so much of my data that I will never return to it."
That's pretty old news. It's been problem free for a long time and it's very well documented where you might have issues.
Burn me once, shame on you, burn me twice, shame on me. If you purchased a new ford and that car fell apart a week later, would you ever buy a ford again? Some will, most wont.
I agree, I love BTRFS and have used it for ages, including some small scale production systems. But I know it still has some edge cases as you mention, which made me wonder: what is the impediment to having those cases fixed? BTRFS has been around long enough and even has some decent commercial support from a few vendors, so it seems like we can't just discount it as "it's open source and nobody is motivated to fix those long tail problems." Is there some kind of design issue that makes them hard?
edit: sorry, cheap shot at Facebook. I have no idea why BTRFS edge cases are not being fixed.

What I do know is that ZFS recently released a feature specifically for the hobbyist/frugal community. The feature allows you to grow an existing RAID array, something a financially sound business would never do. So no customer of anyone supporting ZFS would ever use this, and it took significant effort of ZFS developers to implement this. Not to mention that introducing feature potentially introduces weird behaviour in ZFS that might endanger its (reputation of) stability.

I'm super happy with it, (as my company was not in fact financially sound when we invested in our on-premise storage hardware), but if I was CEO of ZFS I'm not sure I'd sign off on it.

Sarcastic comment adding nothing to the discussion. How rare.
I could already grow mdraid and reiserfs forever ago.
What are the equivalent edge cases in XFS, ZFS, ext4, Reiser4, Reiser5, bcachefs, or f2fs that make btrfs worth considering on a level playing field?
That is very informative about the edge cases for btrfs. My question was what are the edge cases in the other filesystems which put them on a level playing field with btrfs considersing it still has so many.
Is ENOSPC still included as one of those edge cases?
I really do not want to use a file system that has problems at edge cases. A file system needs to be incredibly stable.
BTRFS was marked stable in 2012 yet it still has abysmal performance compared to zfs, ext4, xfs, etc.
Are they using the RAID 5 or RAID 6 code in it? Because that was declared unfit for use well after we were all advised their filesystem was ready for prime time. Then it corrupted and lost data in situations that other file systems did not.

I've heard RAID 1 and RAID 10 modes are safer, but after the FS corrupted my data I haven't really had a lot of trust in it or the people who say again that it's ready for serious use.

I'm no big company, but I've been using btrfs on my Raspberry Pi file server and its disk has been sitting there spinning for something like 8 years with no issues yet. I keep hearing that "btrfs isn't production-ready" but I wouldn't know it from experience.
Perhaps you never lost data from it years after it was billed as ready for use, but many people did. I've never lost data on Reiser3 or Reiser4. I've not used bcachefs or f2fs much yet, but I've never lost data on those either.
Hear, hear on replacing reiserfs with bcachefs. Kent Overstreet could use the support, the project looks good, and to my knowledge he is not a murderer.
The salacious details are always part of this discussion and outside of a very carefully scoped conversation like was requested on the list I think that's always going to happen. On just a technical basis, though, I think bcachefs pays a lot of attention to many of the same issues ReiserFS does, gets more current maintenance, and is maintained more along the lines of kernel interfaces other filesystems use. That last one is the strongest point made in the email to the list.

The advantage of having the creator, designer, and lead maintainer available among free society is definitely real. Regardless of the murder - even if we can separate the man's technical work from the rest of his life activities - he's not doing a lot of work on his creation while he's in custody. He's certainly not going to be furloughed to hackathons or conferences from a murder sentence.

Hans' personality got in the way of technical work in more than one way; Reiser4 never got upstreamed partly (if not mostly) because he had too big of a chip on his shoulder to make the changes requested from upstream.