Hacker News new | ask | show | jobs
by jamesblonde 3694 days ago
You forgot about how sensitive it is to low disk space. Don't go above 80% in production is what I've been told - that's a lot of wasted disk space.
2 comments

You are correct that btrfs and ZFS are sensitive to low disk space. This is also related to how cow filesystems function, they need that free space to commit writes and for snapshots because they, by nature, don't overwrite blocks.

See this for the ZFS example; http://serverfault.com/a/556892/79238

In both cases the exact amount of free space you desire is a mix of workload, fragmentation, and snapshots.

However, I disagree they waste a lot of space. I think both ZFS and btrfs more than make up for the overhead of free space commits through their space saving features. cp --reflink, block suballocation, compresssion, and efficient snapshots outweight the overhead, in my case. Your mileage may vary.

So, isn't there a way to set aside a buffer space so that you don't run into ENOSPACE problems? Maybe like ext's 5% reserve.
As I understand it in btrfs case there is two problems: 1) Metadata in btrfs can use lots of space, especially when you convert from ext4. It might happen that you have gigabytes of free space reserved for metadata, so it can not be used anymore. This can be solved with rebalancing, but that can take ages, which is actually one of the reasons zfs doesn't have bpr rewrite feature. 2) btrfs can have mixed raid levels and in that scenario calculating free space is tricky, but people still rely on common tools, that simply give some estimates in that case. Change the way it estimates free space and you'll have less clueless people complaining about fs running out of it, but more will say btrfs shows too little.
FWIW, Hammer on DragonflyBSD can rebalance and dedup with little memory and doesn't take long, but details matter and the comparison may not be fair. What's rebalance in ZFS might be something much more trivial and less effective in Hammer, but I've deduped Hammer filesystems on machines with little memory compared to what ZFS requires for its data structures in memory.
ZFS' data desuplication requires very little memory. However, it will check every new record write under it with every other record write. The only way to do this in a performant way is to lean on cache. Without sufficient cache, you degrade to performing 3 random sequential IOs, which peforms terribly. The system will continue to run, but it would be slow.

As far as I know, there is no way to implement online deduplication with constant RAM usage without performing poorly as things scale or playing Schrödinger's cat with whether data that should deduplicate is subject to deduplication. Offline data deduplication might work, but it would be performance crippling ZFS' data integrity guarentees.

If HAMMER has online data deduplication that is performant with constant ram, they likely made a sacrifice elsewhere to get it. My guess is that it misses cases, such that while you would expect unique records to be written once, they can be written multiple times.

I believe you're misunderstanding the problems that occur on a cow filesystem. In fact btrfs already has an overcommit disk buffer, and is already doing many ENOSPACE handling tricks.

Have a look; https://btrfs.wiki.kernel.org/index.php/ENOSPC

Reading more for my own interest it seams ZFS uses ZIL to help convert random writes to sequential, which helps with fragmentation under low space. I am curious if bcache, can operate similarly. In addition, I should also point out that this is less of an issue on SSD, due to the nature of how random reads/writes work there anyway (btrfs does a good job of being SSD aware).

The last I checked, ZFS had no open ENOSPC bugs. The trick that it uses is to reserve small amount of space. I forget if it is 1.6% or 3.3%, but whichever that is, it combined with other tricks is considered to be enough.

ENOSPC is a very different condition from lower performance. If you tested filesystem me at 90% full, you should find that all of them have lower performance than when they were empty. You might also find performance varies based on how you filled them.

You're right that the free space overhead is workload dependent. However, compression is orthogonal to the FS and for us, in the Hadoop world, we win nothing with efficient snapshots or other features. The problem we have is estimating how much overhead is 'safe', so we are inherently conservative. The lost disk capacity a big deal on 1k+ hadoop clusters.
I'm going to disagree with you re: compression. Compression at the FS layer brings a lot of benefits, and generally most workloads are IO constrained not CPU constrained. Compression at the FS improves IO at the expense of CPU.

Hadoop is a completely different workload, and maybe not something for ZFS or btrfs. Our Hadoop nodes are not raid, just JBOD ext4 disks. We have been conidering btrfs with nodatacow mount option and lz4 compression, however. We haven't decided if it's better to compress within Hadoop or at the fs layer yet. I would be curious on your findings.

In Hadoop, people are mostly using formats like Parquet, Orc, an if not, compression libs like lzo or snappy. If you believe the Berkeley people (I don't, but the sheeple do), most Spark workloads are CPU bound not IO bound. But irrespective of that, if most of your data is in a columnar data storage format, there's no gain (only cost) in having your FS also try and compress it. JBOD is considered best practice for Hadoop. That's why we're looking at RAID0 and RAID5 - we're researchers :) Actually, MapR recommend using 3 disks in RAID0 as volumes.
btrfs does not support lz4 compression. Unless you meant lzo, which performs terribly on incompressible data, you will want to use ZFS for lz4.

Also, nodatacow is a hack. If you take a snapshot on btrfs with nodatacow, it must use CoW on each thing in the snapshot that is overwritten. Until then, what ever horrible performance nodatacow prevented will manifest temporarily. ZFS is designed to make things asynchronous as much as possible (with the exception of partial record writes to unaccredited records, which needs to be fixed), so it lacks an equivalent to nodatacow and does not need it.

Yes, apologies, I did mean lzo. zlib seemed perhaps too much on the CPU side of the IO/CPU calculation. I am excited to see how btrfs snappy and lz4 support compares when they are added, however.

To the point of lzo performance though, btrfs is a tad smart with compression, it tries to compress an initial 128KiB and if the compressed segment is not smaller than the uncompressed it adds it to a list of no compress files, and will not try compression on that file again (unless of course you force it).

This was for our Hadoop use case, comparing to ext4, so nodatacow would work because we have no desire of snapshots in that environment. It still seems like we're better of compressing within the Hadoop framework (as jamesblonde is doing) and sticking with ext4 jbod, for now at least.

Btrfs will likely never add support for lz4 or snappy:

https://btrfs.wiki.kernel.org/index.php/Compression#Are_ther...

There are links there to mailing list emails explaining the reasoning behind that. The reasoning behind ZFS adopting LZ4 can be found here:

http://wiki.illumos.org/display/illumos/LZ4+Compression

Contrary to what the btrfs developers claimed about LZ4 versus LZJB in ZFS, LZ4's compression performance on incompressible data alone would have been enough to adopt it had ZFS already had LZO support. LZ4 also has the benefit of extremely quick decompression speeds. It also has the peculiar property where running LZ4 repeatedly on low entropy files outperforms "superior" compression algorithms such as gzip. Someone on the LZ4 mailing list discovered this when compressing log files. He compressed a 3.5GB log file into a ~750KB file by running LZ4HC 5 times. Running it twice yielded a 9.5MB file with regular LZ4 compression and a 2MB file with LZ4HC compression. He was able to compress it to ~750KB after running LZ4HC roughly 5 times.

https://groups.google.com/forum/#!topic/lz4c/DcN5SgFywwk

As for btrfs being smart with lzo by compressing only the first 128KB as a heuristic, LZ4 uses a hash table for that and is able to give up much faster. I would expect to see LZ4 significantly outperform LZO. The following site has numbers that appear to confirm that:

https://quixdb.github.io/squash-benchmark/

On a JPEG on their Intel® Core i7-2630QM, LZ4 level 1 compression runs at 608.24MB/sec while LZO level3 compression runs at 68.13MB/sec. Also of possible interest is that Snappy compresses at 559MB/s here. There are a couple caveats though. While I picked the correct variant of LZ4 for ZFS (and also the Linux kernel), I assumed that btrfs is using the default compression level on LZO like ZFS does on LZ4 and knew nothing about the different revisions measured, so I took the best reported number for any of them at the default compression level. That happened to be lzo1b. I also assumed that JPEGs are incompressible, which the data strongly supports. There are a several exceptions, but LZO, Snappy and LZ4 all consider the file to be incompressible.

As for the question of whether to compress in hadoop or in the filesystem, the properties of LZ4 mean that you can do both. If your data is incompressible, LZ4 will give up very quickly (both times). If your data is incompressible after 1 round of LZ4 compression, then the LZ4 compression in ZFS will give up quickly. If it is compressible by two rounds of LZ4, then both will run and you will use less storage space because of it.

I haven't heard this and I've been on the btrfs list for years. The biggest issue you're going to have at 80+% usage, that applies to all file systems getting slow, is transfer rate drops due to inner tracks having fewer sectors; and depending on the workload, seeks are increasing simply because there's more stuff on the disk to go looking for.

As for fragmentation there are two kinds: fragmentation of files into more than one extent, fixed with 'btrfs filesystem defrag' and also autodefrag. And the other is fragmentation of free space, as a result of deleting files and that's fixed with 'btrfs balance' which consolidates extents and writes them into new chunks then frees up large regions of contiguous space on the drives. This is best used with filters.