Hacker News new | ask | show | jobs
by nske 4276 days ago
I think for many or most use-cases, it would make more sense to have "off-line" deduplication (like, I believe, BTRFS does), as you can free-up space on-demand, when you judge it would yield you the most benefit and the system is the least busy.

I'm not sure how much benefit compared to the "real time" deduplication this approach can provide however, as the mapping table would still need to exist in memory, but I think there should be an increase to the write performance of non-duplicate data.

PS. My use case is that I have a few dozens of (linux-vserver) gentoo containers that obviously share many files and, unfortunately, trying to maintain a shared read-only mount of the core system doesn't seem to be practical/viable (as it does i.e. for FreeBSD jails, due to mostly the clear system seperation). The waste is not significant enough to really bother me, it would just be nice to avoid. The solutions that I am aware of are (not sure if I'm missing any, I'd be happy to be pointed to something else, if there is):

- integrated FS-level deduplication (ZoL, BTRFS)

- higher-level deduplication (lessfs and opendedup)

- hard-linking scripts (obviously at the file-level)

1 comments

Cloned snapshots carcass good way to dediplicate similar FS trees and have no RAM overhead. Its functionally the same as hardlinking.
Thanks, you mean LVM2/ZFS/BTRFS based snapshots right? I've seen that mentioned but I was thinking that eventually, with system updates, the snapshots will end up having less and less common blocks with their original source, so I would have to frequently recreate new snapshots from a more similar source and then copy the unique data on top of them again to raise the hit rate -but that sounds a pretty inconvenient thing to do.