Hacker News new | ask | show | jobs
by ajross 295 days ago
Meh. This war was stale like nine years ago. At this point the originally-beaten horse has decomposed into soil. My general reply to this is:

1. The dm layer gives you cow/snapshots for any filesystem you want already and has for more than a decade. Some implementations actually use it for clever trickery like updates, even. Anyone who has software requirements in this space (as distinct from "wants to yell on the internet about it") is very well served.

2. Compression seems silly in the modern world. Virtually everything is already compressed. To first approximation, every byte in persistent storage anywhere in the world is in a lossy media format. And the ones that aren't are in some other cooked format. The only workloads where you see significant use of losslessly-compressible data are in situations (databases) where you have app-managed storage performance (and who see little value from filesystem choice) or ones (software building, data science, ML training) where there's lots of ephemeral intermediate files being produced. And again those are usages where fancy filesystems are poorly deployed, you're going to throw it all away within hours to days anyway.

Filesystems are a solved problem. If ZFS disappeared from the world today... really who would even care? Only those of us still around trying to shout on the internet.

8 comments

For me bcachefs provides a feature no other filesystem on Linux has: automated tiered storage. I've wanted this ever since I got an SSD more than 10 years ago, but filesystems move slow.

A block level cache like bcache (not fs) and dm-cache handles it less ideally, and doesn't leave the SSD space as usable space. As a home user, 2TB of SSDs is 2TB of space I'd rather have. ZFS's ZIL is similar, not leaving it as usable space. Btrfs has some recent work in differentiating drives to store metadata on the faster drives (allocator hints), but that only does metadata as there is no handling of moving data to HDDs over time. Even Microsoft's ReFS does tiered storage I believe.

I just want to have 1 or 2 SSDs, with 1 or 2 HDDs in a single filesystem that gets the advantages of SSDs with recently used files and new writes, and moves all the LRU files to the HDDs. And probably keep all the metadata on the SSDs too.

> automated tiered storage. I've wanted this ever since I got an SSD more than 10 years ago, but filesystems move slow.

You were not alone. However, things changed, namely SSD continued to become cheaper and grew in capacity. I'd think most active data is these days on SSDs (certainly in most desktops, most servers which aren't explicit file or DB servers and all mobile and embedded devices), the role of spinning rust being more and more archiving (if found in a system at all).

Tiering didn't go away with the migration to all-SSD storage. It just got somewhat hidden. All consumer SSDs are doing tiered storage within the drive, using drive-specific heuristics that are completely undocumented, and host software rarely if ever makes use of features that exist to provide hints to the SSD to allow its tiering/caching to be more intelligent. In the server space, most SSDs aren't doing this kind of caching, but it's definitely not unheard-of.
Yeah, for enterprise where you can have dedicated machines for single use (and $) there probably isn't much appeal. That's why I emphasized as a home user, where all my machines are running various applications.

Also for video games, where performance matters, game sizes are huge, and it's nice to have a bunch of games installed.

Until $/GB drops to comparable to HDDs, large-scale storage will continue to use HDDs.
> Compression seems silly in the modern world. Virtually everything is already compressed.

IIRC my laptop's zpool has a 1.2x compression ratio; it's worth doing. At a previous job, we had over a petabyte of postgres on ZFS and saved real money with compression. Hilariously, on some servers we also improved performance because ZFS could decompress reads faster than the disk could read.

> we also improved performance because ZFS could decompress reads faster than the disk could read

This is my favorite side effect of compression in the right scenarios. I remember getting a huge speed up in a proprietary in-memory data structure by using LZO (or one of those fast algorithms) which outperformed memcpy, and this was already in memory so no disk io involved! And used less than a third of the memory.

The performance gain from compression (replacing IO with compute) is not ironic, it was seen as a feature for the various NAS that Sun (and after them Oracle) developped around ZFS.
How do you get a PostgreSQL database to grow to one petabyte? The maximum table size is 32 TB o_O
Cumulative; dozens of machines with a combined database size over a PB even though each box only had like 20 TB.
Probably by using partitioning.
I know my own personal anecdote isn’t much, but I’ve noticed pretty good space savings on the order of like 100 GB from zstd compression and CoW on my personal disks with btrfs

As for the snapshots, things like LVM snapshots are pretty coarse, especially for someone like me where I run dm-crypt on top of LVM

I’d say zfs would be pretty well missed with its data integrity features. I’ve heard that btrfs is worse in that aspect, so given that btrfs saved my bacon with a dying ssd, I can only imagine what zfs does.

> Filesystems are a solved problem. If ZFS disappeared from the world today... really who would even care? Only those of us still around trying to shout on the internet.

Yeah nah, have you tried processing terabytes of data every day and storing them? It gets better now with DDR5 but bit flips do actually happen.

Bit flips can happen, and if it’s a problem you should have additional verification above the filesystem layer, even if using ZFS.

And maybe below it.

And backups.

Backups make a lot of this minor.

Backups are great, but don't help much if you backup corrupted data.

You can certainly add verification above and below your filesystem, but the filesystem seems like a good layer to have verification. Capturing a checksum while writing and verifying it while reading seems appropriate; zfs scrub is a convenient way to check everything on a regular basis. Personally, my data feels important enough to make that level of effort, but not important enough to do anything else.

FWIW, framed the way you do, I'd say the block device layer would be an *even better* place for that validation, no?

> Personally, my data feels important enough to make that level of effort, but not important enough to do anything else.

OMG. Backups! You need backups! Worry about polishing your geek cred once your data is on physically separate storage. Seriously, this is not a technology choice problem. Go to Amazon and buy an exfat stick, whatever. By far the most important thing you're ever going to do for your data is Back. It. Up.

Filesystem choice is, and I repeat, very much a yell-on-the-internet kind of thing. It makes you feel smart on HN. Backups to junky Chinese flash sticks are what are going to save you from losing data.

I apprechiate the argument. I do have backups. Zfs makes it easy to send snapshots and so I do.

But I don't usually verify the backups, so there's that. And everything is in the same zip code for the most part, so one big disaster and I'll lose everything. C'est la vie.

What good is a backup if you can't restore it?
Ok I think you're making a well-considered and interesting argument about devicemapper vs. feature-ful filesystems but you're also kind of personalizing this a bit. I want to read more technical stuff on this thread and less about geek cred and yelling. :)

I wouldn't comment but I feel like I'm naturally on your side of the argument and want to see it articulated well.

I didn't really think it was that bad? But sure, point taken.

My goal was actually the same though: to try to short-circuit the inevitable platform flame by calling it out explicitly and pointing out that the technical details are sort of a solved problem.

ZFS argumentation gets exhausting, and has ever since it was released. It ends up as a proxy for Sun vs. Linux, GNU vs. BSD, Apple vs. Google, hippy free software vs. corporate open source, pick your side. Everyone has an opinion, everyone thinks it's crucially important, and as a result of that hyperbole everyone ends up thinking that ZFS (dtrace gets a lot of the same treatment) is some kind of magically irreplaceable technology.

And... it's really not. Like I said above if it disappeared from the universe and everyone had to use dm/lvm for the actual problems they need to solve with storage management[1], no one would really care.

[1] Itself an increasingly vanishing problem area! I mean, at scale and at the performance limit, virtually everything lives behind a cloud-adjacent API barrier these days, and the backends there worry much more about driver and hardware complexity than they do about mere "filesystems". Dithering about individual files on individual systems in the professional world is mostly limited to optimizing boot and update time on client OSes. And outside the professional world it's a bunch of us nerds trying to optimize our movie collections on local networks; realistically we could be doing that on something as awful NTFS if we had to.

And once more, you're positing the lack of a feature that is available and very robust (c.f. "yell on the internet" vs. "discuss solutions to a problem"). You don't need your filesystem to integrate checksumming when dm/lvm already do it for you.
> You don't need your filesystem to integrate checksumming when dm/lvm already do it for you.

https://wiki.archlinux.org/title/Dm-integrity

> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed

I'd really rather not do that, thanks.

So... there's a reason you had to cite a throwaway comment on a distro wiki and not documentation. Needless to say journaling metadata (something done in some form by every filesystem you will ever use!) does not, in fact, "halve the write speed".
> So... there's a reason you had to cite a throwaway comment on a distro wiki and not documentation.

No, I read the official kernel docs too; the Arch wiki just happened happened to be a quicker way to describe it.

From https://docs.kernel.org/admin-guide/device-mapper/dm-integri... -

> The dm-integrity target can also be used as a standalone target, in this mode it calculates and verifies the integrity tag internally. In this mode, the dm-integrity target can be used to detect silent data corruption on the disk or in the I/O path.

> There’s an alternate mode of operation where dm-integrity uses a bitmap instead of a journal. If a bit in the bitmap is 1, the corresponding region’s data and integrity tags are not synchronized - if the machine crashes, the unsynchronized regions will be recalculated. The bitmap mode is faster than the journal mode, because we don’t have to write the data twice, but it is also less reliable, because if data corruption happens when the machine crashes, it may not be detected.

This is more clearly presented lower down in the list of modes, in which most options describe how they don't actually protect against crashes, except for journal mode:

> J - journaled writes

> data and integrity tags are written to the journal and atomicity is guaranteed. In case of crash, either both data and tag or none of them are written. The journaled mode degrades write throughput twice because the data have to be written twice.

On further reflection, I grant that that might only be talking about the integrity metadata, in which case we just don't know about the impact to data writes and it would be useful to go benchmark to see what the hit is in practice.

EDIT: So I went looking to see if anyone had done that benchmarking and found https://github.com/t13a/dm-integrity-benchmarks which seems to show that actually yes dm-integrity is that bad on data writes. Of course, its possible saving grace is that everything else with the same features also had a performance hit. I also found https://www.reddit.com/r/linuxadmin/comments/1crtggd/why_dmi... talking about it.

FWIW, the github link you show clearly shows the ext4-on-dm stack to be FASTER than ZFS!

It only falls behind, and very signficantly so, on the 1M sequential write test, exactly the situation where you'd expect there to be the least delta between systems! I'm going to bet anything that's a misconfigured RAID.

Frankly looking at that from a "will this work best for my general purpose filesystem used mostly to handle giant software builds and Zephyr test suites" it seems like a no brainer to pick dm, especially so given the simplicity argument.

i'm not one for internet arguments and really just want solutions. maybe you could point me at the details for a setup that worked for you?

based on my own testing, dm has a lot of footguns and, with some kernels, as little as 100 bytes of corruption to the underlying disk could render a dm-integrity volume completely unusable (requiring a full rebuild) https://github.com/khimaros/raid-explorations

Well the intention of the integrity things is to preserve integrity that is an explicit choice, in particular for encrypted data. You definitely need a backup strategy.
One feature I like about ZFS and have not seen elsewhere is that you can have each filesystem within the pool use its own encryption keys but more importantly all of the pool's data integrity and maintenance protection (scrubs, migrations, etc) work with filesystems in their encrypted state. So you can boot up the full system and then unlock and access projects only as needed.

The dm stuff is one key for the entire partition and you can't check it for bitrot or repair it without the key.

> And the ones that aren't are in some other cooked format.

Maybe, if you never create anything. I make a lot of game art source and much of that is in uncompressed formats. Like blend files, obj files, even DDS can compress, depending on the format and data, due to the mip maps inside them. Without FS compression it would be using GBs more space.

I'm not going to individually go through and micromanage file compression even with a tool. What a waste of time, let the FS do it.

> The dm layer gives you cow/snapshots for any filesystem you want already and has for more than a decade. Some implementations actually use it for clever trickery like updates, even.

O_o

Apparently I've been living under a rock, can you please show us a link about this? I was just recently (casually) looking into bolting ZFS/BTRFS-like partial snapshot features to simulate my own atomic distro where I am able to freely roll back if an update goes bad. Think Linux's Timeshift with something little extra.

There are downsides to adding features in layers, as opposed to integrating them with the FS, but dm can do quite a lot:

https://docs.kernel.org/admin-guide/device-mapper/snapshot.h...

DM has targets that facilitate block-level snapshots, lazy cloning of filesystems, compression, &c. Most people interact with those features through LVM2. COW snapshots are basically the marquee feature of LVM2.
The other thing dm/lvm gives you is dogshit performance