Hacker News new | ask | show | jobs
by ChrisInEdmonton 4276 days ago
I have heard that ZFS (at least, ZFS on Linux) requires about 1 GB of memory for every 1 TB of storage. Is this an accurate statement? That's certainly a critical flaw for many use cases, though completely irrelevant for many others. If accurate, what's behind this requirement?
4 comments

On ZFS, all file data is stored using B-trees, where the leaves store X bytes, where X is the value of recordsize at file creation. When a write is done to a file in a dataset with dedup=on, a lookup is done on the data deduplication table. If the lookup finds an entry, it increments a reference. If it does not find an entry, it creates one. This involves 3 random seeks and consequentially your total write throughput is [average record size * IOPS / 3 random seeks]. If you have sufficient memory that the entire DDT can stay cached, then we avoid this limit.

The 1GB of RAM for every 1TB of storage is a rule of thumb for avoiding this limit that someone made a very long time ago. Unfortunately, that rule is wrong because it is impossible to estimate memory requirements by such a simple rule, but it has stuck with us for years despite being wrong.

The amount of memory needed to store a DDT is a function of the average record size. If your dataset has the default recorsize=128K (e.g. you are storing many >1MB files), then you can multiply your system memory by 153.6 to determine the total amount of unique data that you can store before hitting the limit I described. If you are storing many small <128KB files, then you need to calculate the average file size and then calculate [average file size * system memory * 12 / 10355] to obtain the total amount of unique data. This assumes the module settings for zfs_arc_max and zfs_arc_meta_max were left at their ZoL defaults. It is important to keep in mind that the unique data is different from total data, especially in the case of duplicate files. It also excludes metadata required for maintaining indirect block trees, dnodes (i.e. inodes), directories and the DDT itself.

The total amount of data that you can store on a pool before hitting the limit is [unique data * deduplication multipler], where unique data is what I described how to calculate and the deduplication multipler is a number that is either 1 or greater. A measure of the deduplication multipler is provided by `zpool list` as DEDUP, so you should be able to see that yourself. If your pool has data that was written without dedup=on, then any duplicates in that data will be counted as unique data for the purposes of that calculation. To provide a simple example of the deduplication multipler, imagine a pool with only two files that store the same data. The deduplication multipler for that pool would be 2, provided that both were written with dedup=on set on their dataset. If one or both were written with dedup=off, then the deduplication multiper would be 1. You can use zdb to calculate the theoretical deduplication statistics for an entire pool by running `zdb -D $POOLNAME`. Note that this will require significant memory because it constructs a full DDT in userland memory.

No, that's flat out not true.

I've seen that metric thrown around when talking about the "dedup" feature of ZFS, but honestly, don't use dedup unless you know what you're doing. It's way to easy for things to go wrong otherwise.

How much memory does ZFS (and/or ZoL) need per 1TB of storage when dedup is off?

Also, bup ("it backs things up!") efficiently dedups across an ssh connection (using bloom filters) Scales are differrent, but it might work for ZFS as well.

I answered this question here:

https://news.ycombinator.com/item?id=8437921

The recommendations I've read say you need 1GB RAM for system use (assuming a dedicated file server), and then as much RAM (ideally ECC RAM) as you want to give it for caching data.

If you're short on RAM (sub 4GB), you might need to change some of the default settings to avoid problems, but RAM's fairly cheap nowadays, so unless you've got an old machine, it's not likely to be a problem :)

I'd point out that cheap ram is about $10/GB right now, and a cheap HDD is $30/TB. So if you need an extra GB of ram for each TB of storage, you're increasing storage costs by a third.
Ah ha! Thanks. Data deduplication. Well, yeah, that makes sense to me. Thank you very much for clearing that up.
No problem. It's really a shame to have such a high bar for using dedup. It either fits a given workload extremely well, or can be extremely detrimental.

There's been talk in the developer community about ways to address the usability of dedup, but so far nothing has gone further than small prototypes.

Setting zfs_dedup_prefetch=0 has been found to help systems using ZFS data deduplication. There is a patch in ZoL HEAD for the next release that makes zfs_dedup_prefetch=0 by default:

https://github.com/zfsonlinux/zfs/commit/0dfc732416922e1dd59...

Aside from that, you are right that there has not been much done here. Making data deduplication more performant is a difficult task and so far, no one has proposed improvements beyond trivial changes.

I think for many or most use-cases, it would make more sense to have "off-line" deduplication (like, I believe, BTRFS does), as you can free-up space on-demand, when you judge it would yield you the most benefit and the system is the least busy.

I'm not sure how much benefit compared to the "real time" deduplication this approach can provide however, as the mapping table would still need to exist in memory, but I think there should be an increase to the write performance of non-duplicate data.

PS. My use case is that I have a few dozens of (linux-vserver) gentoo containers that obviously share many files and, unfortunately, trying to maintain a shared read-only mount of the core system doesn't seem to be practical/viable (as it does i.e. for FreeBSD jails, due to mostly the clear system seperation). The waste is not significant enough to really bother me, it would just be nice to avoid. The solutions that I am aware of are (not sure if I'm missing any, I'd be happy to be pointed to something else, if there is):

- integrated FS-level deduplication (ZoL, BTRFS)

- higher-level deduplication (lessfs and opendedup)

- hard-linking scripts (obviously at the file-level)

Cloned snapshots carcass good way to dediplicate similar FS trees and have no RAM overhead. Its functionally the same as hardlinking.
Thanks, you mean LVM2/ZFS/BTRFS based snapshots right? I've seen that mentioned but I was thinking that eventually, with system updates, the snapshots will end up having less and less common blocks with their original source, so I would have to frequently recreate new snapshots from a more similar source and then copy the unique data on top of them again to raise the hit rate -but that sounds a pretty inconvenient thing to do.
another anecdote: I've only heard this mentioned specifically about using de-duplication.
All the extra stuff ZFS does, and does performantly, comes at a cost. That's the cost (I can't vouch for the amount.. but the principle is right. CPU power matters too.)

You don't get all those features for free... but you do get them.

While this is true, ZFSOnLinux has made progress on reducing CPU overhead. Additional improvements will be in the next release. In particular, the performance of NFS exports of ZFS datasets will increase considerably (some are seeing a factor of 2 increase) while CPU utilization has dropped:

https://github.com/zfsonlinux/spl/pull/369#issuecomment-5839...