Hacker News new | ask | show | jobs
by rzzzt 486 days ago
I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:

- compute SHA256 hashes for each file on the source side

- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)

- mirror the source directory structure to the destination

- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.

Then I got too scared to actually use it :)

1 comments

Hard links are not a suitable alternative here. When you deduplicate files, you typically want copy-on-write: if an app writes to one file, it should not change the other. Because of this, I would be extremely scared to use anything based on hard links.

In any case, a good design is to ask the kernel to do the dedupe step after user space has found duplicates. The kernel can double-check for you that they are really identical before doing the dedupe. This is available on Linux as the ioctl BTRFS_IOC_FILE_EXTENT_SAME.

It was for me. I was using rsync with "--link-dest" earlier for this purpose, but that only works if the file is present in consecutive backups. I wanted to have the option of seeing a potentially different subset of files for each backup and saving disk space at the same time.

Restic and Borg can do this at the block level, which is more effective but requires the tool to be installed when I want to check out something.