Hacker News new | ask | show | jobs
by Borg3 619 days ago
I wish he could write up a bit about XFS failure he had. Im using it from many many years and there is no issues at all.
4 comments

I'm interested too. I'm using XFS only, and have for many years. On my own boxes, but my company also uses XFS for all the data on customer computers. We did extensive testing many years back, and XFS was the only filesystem at the time which gave a linear, constantly very high performance when writing and reading huge amounts of data (real-time data, dips in performance is a 100% no-no), and also not degrading when having huge numbers of files. We've never had a customer lose data due to XFS problems, and at this point I can't imagine how much data that would be, except that it's astronomical.

When that's said, we had routine XFS losses on SGI boxes. That was a very well known scenario: Write constantly to a one-page text file, say, every few seconds, then power cycle the machine. The file would be empty afterwards. This doesn't happen on Linux, I vaguely recall discussing this with someone some years ago (maybe here on HN) and something was changed at some point, maybe when SGI migrated XFS to Linux, or shortly after.

It hard to know the timeline with his data loss, but I am assuming it was a long time ago.

XFS is originally from SGI Irix and was designed to run on higher end hardware. SGI donated it to Linux in 1999 and it carried a lot of its assumptions over.

For example on SGI boxes you had "hardware raid" with cache, which essentially is a sort of embedded computer with it's own memory. That cache had a battery backup so that if the machine had a crash or sudden power loss the hardware raid would live on long enough to finish its writes. SGI had tight control over the type of hardware you could use and it was usually good quality stuff.

In the land of commodity PC-based servers this isn't often how it worked. Instead you just had regular IDE or SATA hard drives. And those drives lied.

On cheap hardware the firmware would report back it had finished writes when in fact it didn't because it made it seem faster in benchmarks. And consumers/enterprise types looking to save money with Linux mostly bought whatever is the cheapest and fastest looking on benchmarks.

So that if there was a hardware failure or sudden power loss there would could be several megs of writes that were still in flight when the file system thought they were safely written to disk.

That meant there was a distinct chance of dataloss when it came to using Linux and XFS early on.

I experienced problems like that in early 2000s era Linux XFS.

This was always a big benefit to sticking with Ext4. It is kinda dumb luck that Ext4 is as fast as it is when it came to hosting databases, but the real reason to use it is because it had a lot of robust recovery tools. It was designed from the ground up with the assumption that you were using the cheapest crappiest hardware you can buy (Personal PCs).

However modern XFS is a completely different beast. It has been rewritten extensively and improved massively over what was originally ported over from SGI.

It is different enough that a guy's experience with it from 2005 or 2010 isn't really meaningful.

I have zero real technical knowledge on file systems except as a end-user, but from what I understand FreeBSD uses UFS that uses a "WAL" or "write ahead log".. where it records writes it is going to do before it does it. I think this is a simpler but more robust solution then the sort of journalling that XFS or Ext4 uses. The trade off is lower performance.

As far as ZFS vs Btrfs... I really like to avoid Btrfs as much as possible. A number of distros use it by default (OpenSUSE, Fedora, etc), but I just format everything as a single partition as Ext4 or XFS on personal stuff. I use it on my personal file server, but it really simple setup with UPS. I don't use ZFS, but I strongly suspect that btrfs simply failed to rise to its level.

One of the reasons Linux persists despite not having something up to the level of ZFS is that most of ZFS features are redundant to larger enterprise customers.

They typically use expensive SAN or more advanced NAS that has proprietary storage solutions that provide ZFS-like features long before ZFS was a thing. So throwing something as complicated as ZFS on top of that really provides no benefit.

Or they use one of Linux clustered file system solutions, of which there is a wide selection.

Facebook runs their entire stack using Btrfs [0]. I would encourage anyone who is stuck in the "oh btrfs is so buggy and loses data" mindset (not helped by articles like this [1] that play off btrfs as some half-baked contraption, when it's really btrfs raid that needs a LOT more time to bake) to look into things and realize that large companies (OpenSuse, Redhat, Faceboook) have poured a lot of time to get it to work well.

I don't know about it's multi-disk story (I do use ZFS for that personally), but for single disk options it is great. You get so many of the ZFS benefits (snapshots, rollback, easily create and delete volumes, etc) with MUCH lower memory usage (at least in my own experiments to try this out).

[0] https://lwn.net/Articles/824855/ [1] https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

Facebook has stacks of thousands of spare nodes ready at any moment to replace a failed node. All essential data will be replicated across many different boxes so if a box fails you just replace it with a fresh node and replicate the data there.

This is much different to the consumer usecase where computers are pets and not cattle. A failed filesystem the night before you need to turn in your thesis may have a much larger impact on your life.

Another thing to consider is that Facebook runs btrfs on enterprise hardware (including SSDs with battery backups) which is going to be much more reliable than some chromebook which lives in the bottom of your backpack that you bring on transit every day.

Finally, I will say that the copy on write features of btrfs can result is some wildly different behaviour based upon how you use it. You can get into some very bad pathological cases with write amplification, and if you run btrfs on top of LUKS it can nearly be impossible to figure out why your disk is being pegged despite very little throughput at the VFS layer.

The ChromeOS Linux dev VM uses btrfs by the way.
So much FUD in this discussion. Christ Mason talked publicly that they use the cheapest SSDs they can find (even worse things than what he would be willing to put in his laptop), and that they investigate every instance of btrfs corruption. You're saying the exact opposite of the main btrfs guy at Facebook. I wonder who is right...
Who is right, one guy whos reputation relies on something not breaking or a bunch of end users who report the thing broke for them?

I experienced issues with write amplification within the past few months in Ubuntu 22 so it isn’t like all the issues are gone. I do agree that there are less issues now than there was before, but I will still say that btrfs still breaks or behaves unexpectedly much more often than ext4 or xfs.

Meta does a lot of things that don't scale for reliable/trustworthy systems and aren't suitable for all use-cases. (I also used to work there too.)

ZFS is only reliable where it was battled-tested: on Solaris. ZoL has been infinitely tinkered with and smashed up that it's nothing like running a Thumper as a NAS.

XFS + mdadm on Linux is, without a doubt, far more reliable than ZoL. Ask me how I know. I have the scars to prove it.

ZFS on Linux is absolutely fine in high-performance and critical computing applications.

I also owned a Thumper and Thor running Solaris in 2009. Much prefer Linux and the hardware solutions today.

ZFS has been plenty battle tested on FreeBSD.
Not hardly, and not in the way you think. They replaced their arguably purer ZFS port to replace it with ZoL. As such, it's nowhere near as tested and proven as existing solutions like ext4 and xfs Redhat has deployed to millions of machines for decades. ZFS has too many religious fanboys who hype it without considering that boring and reliable are less risky than betting on code that hasn't had nearly the same scale of enterprise experience.
I am well aware of that change.

What specific problems are there with the ZFS implementation on FreeBSD? You claim it is not battle tested, I find it to be rock solid..

Did you or other XFS users try out stratis?
tell me your story
Yeah, my setup too, XFS + mdadm (+ eventually LVM2). Rock solid. It might not have HW raid performance, but in terms of stability, flexibility and recovery its absolutly unbeatable!
I am stuck in the btrfs-is-buggy mindset precisely because it managed to lose my root partition on a single disk machine. It might also have raid problems, but not exclusively.
Me too. Repeatedly, at least once a year, on 3 different machines.

The cause? Filling up the filesystem. Why? Because of OS snapshots.

(Aside: why can they fill it? Because it doesn't give a straight answer to `df -h`. Why not? Because of snapshots.)

That happened recently? A few years ago they added a reserved area used for emergency purposes that should solve situations like that. Can't say I've run into these problems, although I don't tend to run btrfs very heavily because performance becomes unacceptable long before that due to CoW.

https://btrfs.readthedocs.io/en/latest/btrfs-filesystem.html

(Look for "GlobalReserve")

> That happened recently?

It happened to me repeatedly on both openSUSE Leap and openSUSE Tumbleweed during the 4 years I worked for SUSE: 2017-2021.

The `df` command doesn't work: it does not give reliable info. That alone disqualifies this FS for me.

The `fsck` equivalent does not work: every time I have tried, it corrupts volumes into unreadability.

Those 2 things are hard requirements for me.

I raised this internally as significant issues. They were dismissed.

> I don't know about it's multi-disk story (I do use ZFS for that personally), but for single disk options it is great.

I can reliably, across vendors and drives, break RAID10 on BTRFS where MD+LVM are totally fine. Simply pull power. Discovered this when building out my latest workstation.

I haven't tried other configurations; after finding this pattern I decided to leave BTRFS for single-disk configurations where I want CoW

I use btrfs a lot but I'm not sure if I'd use it for production servers. The I/O bandwidth is just a lot lower and I get weird latency problems on desktop Linux when BTRFS is very busy that I don't get on other file systems. Then again, I probably wouldn't use ZFS for anything but a NAS setup either.
The default is to coalesce trim requests into large batches and issue them once per minute or so. Most other filesystems don't use online trim. This can cause latency spikes. If you'll ever decide to try it out again, try disabling online trim.

https://btrfs.readthedocs.io/en/latest/Trim.html

> Facebook runs their entire stack using Btrfs

Yeah, and when I was there machines would run out of disk space at 50% usage and it took months to figure out why. In the mean time, they'd just reimage the machine and hope. I don't recall any issues with data loss, but it didn't have the air of reliability.

But my team was weird at FB, our uptimes of 45 days were way above the average, and we ran into all sorts of things because we operated outside the norm.

Is the structure of 800gb btrfs containers mentioned in [0] how user data is stored? Just sharded across billions of containers?
Last time they talked about it (that I know of -- when Fedora was contemplating using btrfs and asked Chris Mason et al for their opinion), FB were running databases on xfs and were looking for ways to place them on raw disks for maximum performance. So not the entire stack.
Why is this grey/down? Is there something factually incorrect?

Edit: it's less grey now.

> For example on SGI boxes you had "hardware raid" with cache, which essentially is a sort of embedded computer with it's own memory. That cache had a battery backup so that if the machine had a crash or sudden power loss the hardware raid would live on long enough to finish its writes. SGI had tight control over the type of hardware you could use and it was usually good quality stuff.

Most of the SGI machine I've used of various sizes did not have hardware raid. In my experience, you were more likely to run into hardware raid on a PC than on traditional SGI or Sun servers (I don't have much experience with AIX or HP-UX), unless the unix server was in a SAN environment.

Yep. The Octanes and the Challenge servers we used at work didn't have hardware raid, and, contrary to grandparent we did have regular issues on SGI with XFS (loss of data after power cycling, always), while we never had that on Linux, which surprised me. After all, it was so easily reproduced (on SGI): Write regularly to a file, power cycle, file empty afterwards. Did that on Linux, everything fine. Every time. Never ever had losses on Linux. NB: I did not test this immediately after XFS was ported to Linux, it's very likely that things were improved on shortly after, before we started testing XFS at work.

(As for hardware raid - I didn't start to see hardware raid regularly until HP started shipping rackmount servers with Compaq raid hardware, way back. Linux boxes..)

Sounds fair enough. I never used SGI personally. I was just repeating what I read while dealing with XFS issues back in the day.

Bad old days.

"One of the reasons Linux persists despite not having something up to the level of ZFS is that most of ZFS features are redundant to larger enterprise customers."

ZFS is used heavily on Linux and runs well, though there are some limitations which are being addressed over time in the OpenZFS project. It is used across all areas that Linux serves, whether laptop, desktop, home server all the way to enterprise. https://openzfs.org/wiki/Main_Page

ZFS on Linux is not really usable for most users because every kernel update can break your ZFS compatibility.

Meaning unless you want to put in the time to manually test every kernel update and ensure your kernel version stays in-sync with OpenZFS you can very likely end up with an unbootable system.

Ubuntu supports ZFS, so if you can track Ubuntu's kernel, you get ZFS without risking unbootable system.
This is one of the main reasons Void Linux is "stuck" on kernel 6.6.
You mean apart from 6.6 being the current latest longterm kernel?

https://kernel.org/

The 'linux' package on Void is just a meta package. Install whatever kernel series you want. I'm running 6.10.11, with ZFS 2.2.6 on my Void workstation.
> I understand FreeBSD uses UFS that uses a "WAL" or "write ahead log".. where it records writes it is going to do before it does it.

I think you're describing UFS soft-updates? I think that's more or less for meta data updates, not data data. It's been a while since I reviewed it, but it gets you nice things like snapshots and background fsck so after an unclean restart your system can get back to work immediately and clean up behind the scenes. There is some sort of journalling that's fairly new, but my experience from 10 years ago was soft-updates and background fsck just worked; and if you wanted better, ZFS was probably what you want, if you can afford copy on write.

I had one a few years back where we ran out of inodes on a Jenkins machine on CentOS 7 and it crashed and couldn’t remount the filesystem. I had to restore a backup which was time consuming on a 4TB volume with crazy amounts of files.
Used it since the late 90s on IRIX, think there were a few issues early on with the endian swap, but no issues for the best part of twenty years for me!