Hacker News new | ask | show | jobs
by krondor 3694 days ago
btrfs is stable as of kernel 3.14 or so. The features in btrfs that are unstable at this point are raid56 (write hole on power failure risk), autodefrag (and mainly just with high transaction files like vm images and databases), and ext34/in place filesystem conversion.

We've been running btrfs in production since Ubuntu 14.04 with excellent results. The feature set vastly outweighs the few risks that remain.

5 comments

You forgot about how sensitive it is to low disk space. Don't go above 80% in production is what I've been told - that's a lot of wasted disk space.
You are correct that btrfs and ZFS are sensitive to low disk space. This is also related to how cow filesystems function, they need that free space to commit writes and for snapshots because they, by nature, don't overwrite blocks.

See this for the ZFS example; http://serverfault.com/a/556892/79238

In both cases the exact amount of free space you desire is a mix of workload, fragmentation, and snapshots.

However, I disagree they waste a lot of space. I think both ZFS and btrfs more than make up for the overhead of free space commits through their space saving features. cp --reflink, block suballocation, compresssion, and efficient snapshots outweight the overhead, in my case. Your mileage may vary.

So, isn't there a way to set aside a buffer space so that you don't run into ENOSPACE problems? Maybe like ext's 5% reserve.
As I understand it in btrfs case there is two problems: 1) Metadata in btrfs can use lots of space, especially when you convert from ext4. It might happen that you have gigabytes of free space reserved for metadata, so it can not be used anymore. This can be solved with rebalancing, but that can take ages, which is actually one of the reasons zfs doesn't have bpr rewrite feature. 2) btrfs can have mixed raid levels and in that scenario calculating free space is tricky, but people still rely on common tools, that simply give some estimates in that case. Change the way it estimates free space and you'll have less clueless people complaining about fs running out of it, but more will say btrfs shows too little.
FWIW, Hammer on DragonflyBSD can rebalance and dedup with little memory and doesn't take long, but details matter and the comparison may not be fair. What's rebalance in ZFS might be something much more trivial and less effective in Hammer, but I've deduped Hammer filesystems on machines with little memory compared to what ZFS requires for its data structures in memory.
ZFS' data desuplication requires very little memory. However, it will check every new record write under it with every other record write. The only way to do this in a performant way is to lean on cache. Without sufficient cache, you degrade to performing 3 random sequential IOs, which peforms terribly. The system will continue to run, but it would be slow.

As far as I know, there is no way to implement online deduplication with constant RAM usage without performing poorly as things scale or playing Schrödinger's cat with whether data that should deduplicate is subject to deduplication. Offline data deduplication might work, but it would be performance crippling ZFS' data integrity guarentees.

If HAMMER has online data deduplication that is performant with constant ram, they likely made a sacrifice elsewhere to get it. My guess is that it misses cases, such that while you would expect unique records to be written once, they can be written multiple times.

I believe you're misunderstanding the problems that occur on a cow filesystem. In fact btrfs already has an overcommit disk buffer, and is already doing many ENOSPACE handling tricks.

Have a look; https://btrfs.wiki.kernel.org/index.php/ENOSPC

Reading more for my own interest it seams ZFS uses ZIL to help convert random writes to sequential, which helps with fragmentation under low space. I am curious if bcache, can operate similarly. In addition, I should also point out that this is less of an issue on SSD, due to the nature of how random reads/writes work there anyway (btrfs does a good job of being SSD aware).

The last I checked, ZFS had no open ENOSPC bugs. The trick that it uses is to reserve small amount of space. I forget if it is 1.6% or 3.3%, but whichever that is, it combined with other tricks is considered to be enough.

ENOSPC is a very different condition from lower performance. If you tested filesystem me at 90% full, you should find that all of them have lower performance than when they were empty. You might also find performance varies based on how you filled them.

You're right that the free space overhead is workload dependent. However, compression is orthogonal to the FS and for us, in the Hadoop world, we win nothing with efficient snapshots or other features. The problem we have is estimating how much overhead is 'safe', so we are inherently conservative. The lost disk capacity a big deal on 1k+ hadoop clusters.
I'm going to disagree with you re: compression. Compression at the FS layer brings a lot of benefits, and generally most workloads are IO constrained not CPU constrained. Compression at the FS improves IO at the expense of CPU.

Hadoop is a completely different workload, and maybe not something for ZFS or btrfs. Our Hadoop nodes are not raid, just JBOD ext4 disks. We have been conidering btrfs with nodatacow mount option and lz4 compression, however. We haven't decided if it's better to compress within Hadoop or at the fs layer yet. I would be curious on your findings.

In Hadoop, people are mostly using formats like Parquet, Orc, an if not, compression libs like lzo or snappy. If you believe the Berkeley people (I don't, but the sheeple do), most Spark workloads are CPU bound not IO bound. But irrespective of that, if most of your data is in a columnar data storage format, there's no gain (only cost) in having your FS also try and compress it. JBOD is considered best practice for Hadoop. That's why we're looking at RAID0 and RAID5 - we're researchers :) Actually, MapR recommend using 3 disks in RAID0 as volumes.
btrfs does not support lz4 compression. Unless you meant lzo, which performs terribly on incompressible data, you will want to use ZFS for lz4.

Also, nodatacow is a hack. If you take a snapshot on btrfs with nodatacow, it must use CoW on each thing in the snapshot that is overwritten. Until then, what ever horrible performance nodatacow prevented will manifest temporarily. ZFS is designed to make things asynchronous as much as possible (with the exception of partial record writes to unaccredited records, which needs to be fixed), so it lacks an equivalent to nodatacow and does not need it.

Yes, apologies, I did mean lzo. zlib seemed perhaps too much on the CPU side of the IO/CPU calculation. I am excited to see how btrfs snappy and lz4 support compares when they are added, however.

To the point of lzo performance though, btrfs is a tad smart with compression, it tries to compress an initial 128KiB and if the compressed segment is not smaller than the uncompressed it adds it to a list of no compress files, and will not try compression on that file again (unless of course you force it).

This was for our Hadoop use case, comparing to ext4, so nodatacow would work because we have no desire of snapshots in that environment. It still seems like we're better of compressing within the Hadoop framework (as jamesblonde is doing) and sticking with ext4 jbod, for now at least.

I haven't heard this and I've been on the btrfs list for years. The biggest issue you're going to have at 80+% usage, that applies to all file systems getting slow, is transfer rate drops due to inner tracks having fewer sectors; and depending on the workload, seeks are increasing simply because there's more stuff on the disk to go looking for.

As for fragmentation there are two kinds: fragmentation of files into more than one extent, fixed with 'btrfs filesystem defrag' and also autodefrag. And the other is fragmentation of free space, as a result of deleting files and that's fixed with 'btrfs balance' which consolidates extents and writes them into new chunks then frees up large regions of contiguous space on the drives. This is best used with filters.

The problem is "in production" you can do a lot of things which ironically don't work for smaller users, because "in production" you specifically optimize and assume hardware and applications will crash in catastrophic ways pretty much all the time and you code quite differently (i.e. it is totally reasonable to ask applications to be distributed and deal with it).
Agreed, my workload is probably not representative of a small organization or user. I absolutely leverage resiliency in software, distributed computing, and mitigate single points of failure, etc... This is a good point.

Now, I do think these methods are increasingly approachable for all users, however. A lot of that is actually enabled by the feature sets of ZFS and btrfs. The default Ubuntu installer, for example, will create snapshots during OS upgrade for rollback if the upgrade fails. ZFS and btrfs send/receive feature allows for efficient DR clones (not to mention seed images and snapshots). LXD leverages ZFS if you choose to for rapid containerization and snapshots.

These intrinsic abilities of these two filesystems allow for smaller users and organizations to improve those workflows to be more risk averse in general (even while assuming some risk in a newer FS).

Just, if possible, be up to date on the known issues and run more recent kernels and userland utilities.

https://btrfs.wiki.kernel.org/index.php/Gotchas

"stable" != "mature"
That's a fair point. I guess we can debate what 'mature' means.

In development since 2007, stable since 2014. ZFS, in development since 2001 (correction I erroneously listed 2005 earlier), stable since 2006 (or at least Solaris included since then). Do you consider the Solaris years or just the Linux years and then do you consider the Linux Debian/Ubuntu sanctioned years or the ZoL years?

I'm fine with mature since included in the default installer on Ubuntu/Redhat/Oracle/SUSE/etc... for my definition of btrfs maturity.

I think you've confused your dates. ZFS has been in development since 2001. It was first introduced in 2005.
Yes, thanks for catching that, I've edited the parent.

As you can tell I'm not that strong on ZFS other than what I've learned in some small tinkering and discussion with others.

It seems odd that after all this time that btrfs wouldn't be able to handle database workloads. If it can't handle hight I/O it doens't have much use except for a bootdisk. Do you have any reading you could suggest on this issue?
As zanny said, cow presents an issue for transactional workloads (databases, vm images, etc...) as each write fragments the file, by nature of the process of cow.

btrfs can handle database workloads but you have to disable cow for them (which you can do at the file, directory, or subvolume level in btrfs). You would specify the nodatacow mount option, or chattr +C (file/directory).

The btrfs autodefrag is still rather new, and needs some work. I expect that could be the long term fix (manual defrag is fine now, but you wouldn't want to call it frequently on a db file), I'm not sure how ZFS handles the fragmentation, but I do know in the past ZFS observed similar issues (seems to have been mostly resolved). I should also point out that disabling cow doesn't really fully disable it, snapshots can still function, etc... however, I'm sure that once you start to use the other cow functions you might observe slipping performance due to fragmentation of these types of files.

https://blog.pgaddict.com/posts/friends-dont-let-friends-use... https://bartsjerps.wordpress.com/2013/02/26/zfs-ora-database...

btrfs is only particularly slow on database files if they are not marked for inplace writing because the filesystem is default copy on write, which is horribly slow for huge constantly changing files (they end up highly fragmented across data blocks if you don't turn off cow).
btrfs is not stable!, xfs is stable