Hacker News new | ask | show | jobs
by ChrisInEdmonton 4276 days ago
Ah ha! Thanks. Data deduplication. Well, yeah, that makes sense to me. Thank you very much for clearing that up.
1 comments

No problem. It's really a shame to have such a high bar for using dedup. It either fits a given workload extremely well, or can be extremely detrimental.

There's been talk in the developer community about ways to address the usability of dedup, but so far nothing has gone further than small prototypes.

Setting zfs_dedup_prefetch=0 has been found to help systems using ZFS data deduplication. There is a patch in ZoL HEAD for the next release that makes zfs_dedup_prefetch=0 by default:

https://github.com/zfsonlinux/zfs/commit/0dfc732416922e1dd59...

Aside from that, you are right that there has not been much done here. Making data deduplication more performant is a difficult task and so far, no one has proposed improvements beyond trivial changes.

I think for many or most use-cases, it would make more sense to have "off-line" deduplication (like, I believe, BTRFS does), as you can free-up space on-demand, when you judge it would yield you the most benefit and the system is the least busy.

I'm not sure how much benefit compared to the "real time" deduplication this approach can provide however, as the mapping table would still need to exist in memory, but I think there should be an increase to the write performance of non-duplicate data.

PS. My use case is that I have a few dozens of (linux-vserver) gentoo containers that obviously share many files and, unfortunately, trying to maintain a shared read-only mount of the core system doesn't seem to be practical/viable (as it does i.e. for FreeBSD jails, due to mostly the clear system seperation). The waste is not significant enough to really bother me, it would just be nice to avoid. The solutions that I am aware of are (not sure if I'm missing any, I'd be happy to be pointed to something else, if there is):

- integrated FS-level deduplication (ZoL, BTRFS)

- higher-level deduplication (lessfs and opendedup)

- hard-linking scripts (obviously at the file-level)

Cloned snapshots carcass good way to dediplicate similar FS trees and have no RAM overhead. Its functionally the same as hardlinking.
Thanks, you mean LVM2/ZFS/BTRFS based snapshots right? I've seen that mentioned but I was thinking that eventually, with system updates, the snapshots will end up having less and less common blocks with their original source, so I would have to frequently recreate new snapshots from a more similar source and then copy the unique data on top of them again to raise the hit rate -but that sounds a pretty inconvenient thing to do.