Facebook doesn’t use it’s servers the same way we use our computers. They image machines in and out of existence. They don’t have file systems going through power loss on a weekly basis. They don’t upgrade the kernel on existing installations. They don’t expand their storage after the fact. If their machines fail, they don’t care - they’re completely fungible.
> Every FS corruption and weird behavior is put aside and investigated. They very much do care.
Just because you and I are using different meanings of the word "care" doesn't mean the point isn't valid. They "care" in that they would like to know what went wrong and study it further. They don't "care" in the sense that they suffered no real harm and no stakes were riding on any one particular server that failed. It's not just a matter of having a backup/redundancy, it's about having automated systems (or even just standard procedures that are being executed on a daily basis at that scale one way or the other) that take care of these failures. So even in production, "regular" btrfs users might have backups so "no lasting damage" would be incurred, but that's hardly the same as openly volunteering themselves for risk.
That's all besides the main point: Facebook is deploying "known good" configurations. They're using a very select subset of features. They're not trusting changed btrfs features/implementations being correct or, as was my experience, worrying about less-used/tested codepaths leading to data loss.
“Also keep in mind we pay really close attention to burn rates for our drives,
because obviously at our scale it translates to millions of dollars. Btrfs has
improved our burn rates with the compression, as the write amplification goes
drastically down, thus extending the life of the drives.”
As with anything it comes down to money. Yes a machine going down doesn’t impact the cluster but it does impact their wallet. Every failure of a disk costs money and on the scale of the big boys that can add up to big money.
So while “the system” doesn’t care about drive failures the accountants and CFO’s absolutely care.
Just pointing out that "caring about physical drive failure" and "caring about disk corruption or data loss" are completely independent and the latter does not directly equate big money (as there are already systems and SOP in place to deal with handling failed servers). Btrfs isn't notorious for actually frying disks, just the data on them.
Do they care about the FS just silently eating data? I ask because btrfs has been known to do that. Sure, you're not replacing the drive, but you're probably wiping the VM's disk image and creating a new one.
The thing of recreating VMs a lot instead of upgrading or keeping them a long time is production use. The whole point of VMs, aside from not taking 3 months to order and provision, is that you can put the "long-term maintenance of a disk and OS" cost to zero and just recreate from SoR (hopefully git) whenever something needs to change. If you are editing state on persistent VMs, you are missing some really nice features of VM based deployment. It's like containers but more well understood and possibly more cost efficient (depending on the code).
Lots of people seem to turn up their nose at btrfs. Is there a reason for that? Was it perhaps launched before it was really ready and people still remember early versions?
I can give you mine. I was working with a Raspberry Pi 3, and using a USB SSD. It's a USB2 link, so a bit choked, and I figured, hey, filesystem compression can help here, btrfs supports it, great! And it helped - you could get "real world" disk reads a good bit faster than the USB2 bus speed.
Until one day, I rebooted, and it didn't come back up. Analysis on another system was that the btrfs filesystem was just... toast. I've no idea what happened, I found some stuff that said "Oh, uh... don't use btrfs over USB, it kinda breaks in some cases...", the recovery tools couldn't even decide that the filesystem was a btrfs filesystem, and, nope.
I put data on the filesystem, I expect it to come back. btrfs broke that guarantee with a Pi full of data (nothing too important, they're just scratch systems and light desktops), so... I now stick to the boring things like ext4 that have been exceedingly well proven. Is it the best filesystem out there in terms of features? Certainly not. Am I pretty darn sure that I'm not going to trip some edge case and totally scramble the filesystem? Yes, and that's what I care about.
Lots of us got burnt with data loss and aren’t willing to give it a chance again. Maybe it’s better now? I don’t have a reason to give it a second chance when there are plenty of stable alternatives that have saved my ass I’m the past instead of telling me I’m SOL.
That's exactly it. I've used btrfs in production since Ubuntu 10.04, at scale since 12.04, and had nothing but great experiences with it - especially with the seed volume functionality, which allowed me to build the foundation for a major container-as-a-service platform before Docker was a thing. btrfs never lost our data, but I've also seen way too many btrfs kernel panics that were clearly related to insufficiently mature filesystem code, and I can understand people who did lose data, got burned and never want to trust btrfs again.
For me, it was https://bugzilla.kernel.org/show_bug.cgi?id=85581. Yes this endless-write loop is long fixed, but, given that something with 99% similar symptoms has surfaced in kernel 5.16 (or was this original bug not fixed properly?), I would say no.
It is a complex beast. It needs some maintenance and performance will degrade without it.
I've never lost data to it, I've never tried the soft RAID modes it has though, but I've experienced it making a system almost unusably slow. SUSE out of the box with it automates a lot of it and it's pretty remarkable. Transactional mode if you want it seems like a game changer for some servers and the snapper stuff has saved my bacon a couple times. It's getting there but like I said, it needs some maintenance and just formatting a partition with it is likely the wrong way to experience it.
For me, when I tried btrfs (which was about 10 years ago now) I discovered it was extremely slow. And not like 50% slower—when I switched to ext4 or xfs on the same disk with the same data I was getting a 10x or so speedup.
AFAIK it's not so bad in single-device use-cases. I think most of the more recent failures I've heard about have all had to do with Btrfs RAID. The prevailing wisdom still seems to be that if you want to use RAID, use an md soft-RAID device or LVM under your single-device Btrfs filesystem.
RAID, especially 5 or 6 was my main concern, yes. If I'm using hardware RAID or a soft RAID under the FS, much of the promised benefit of btrfs is gone anyway. I can add to storage pools with ZFS or expand an LVM set, too, but what does using btrfs on top of anything buy me that ZFS, bcachefs, or something like f2fs does not?
> but what does using btrfs on top of anything buy me that ZFS, bcachefs, or something like f2fs does not?
Well, inclusion in mainline kernels is the big one over ZFS and bcachefs, I guess.
I haven't seen F2FS before, so I'm commenting on the basis of 30 seconds of Googling, here, but it doesn't look like it supports either copy-on-write or snapshots, which are the big selling points I've heard for continuing to use Btrfs on top of a device manager.
All problems are edge cases, to some degree or another. The only real question is how far out those edges are, and whether users are likely to bump into them.
Edge cases like Raid5/6 which had the write hole issue approximately a decade after btrfs was released. At some point you say "This filesystem has lost so much of my data that I will never return to it."
Burn me once, shame on you, burn me twice, shame on me. If you purchased a new ford and that car fell apart a week later, would you ever buy a ford again? Some will, most wont.
A better analogy would be if the car got you in an accident. I don’t care if something breaks quickly as long as that means it can be returned or replaced.
I agree, I love BTRFS and have used it for ages, including some small scale production systems. But I know it still has some edge cases as you mention, which made me wonder: what is the impediment to having those cases fixed? BTRFS has been around long enough and even has some decent commercial support from a few vendors, so it seems like we can't just discount it as "it's open source and nobody is motivated to fix those long tail problems." Is there some kind of design issue that makes them hard?
edit: sorry, cheap shot at Facebook. I have no idea why BTRFS edge cases are not being fixed.
What I do know is that ZFS recently released a feature specifically for the hobbyist/frugal community. The feature allows you to grow an existing RAID array, something a financially sound business would never do. So no customer of anyone supporting ZFS would ever use this, and it took significant effort of ZFS developers to implement this. Not to mention that introducing feature potentially introduces weird behaviour in ZFS that might endanger its (reputation of) stability.
I'm super happy with it, (as my company was not in fact financially sound when we invested in our on-premise storage hardware), but if I was CEO of ZFS I'm not sure I'd sign off on it.
That is very informative about the edge cases for btrfs. My question was what are the edge cases in the other filesystems which put them on a level playing field with btrfs considersing it still has so many.
Are they using the RAID 5 or RAID 6 code in it? Because that was declared unfit for use well after we were all advised their filesystem was ready for prime time. Then it corrupted and lost data in situations that other file systems did not.
I've heard RAID 1 and RAID 10 modes are safer, but after the FS corrupted my data I haven't really had a lot of trust in it or the people who say again that it's ready for serious use.