Hacker News new | ask | show | jobs
by beagle3 2291 days ago
Sometimes they do - e.g. if you replace a file in the ISO that is the same size up to block alignment, which is common when e.g. editing a text file or recompiling an executable with a minor change. They almost always do when it's a VM image representing a disk - only some blocks change every write.

However, with self synchronizing hashes of the kind used by rsync bup and borg, it doesn't matter - you could have a 1TB file, delete a single byte at position 100 - and you only need to store or transfer one new block (with average size 8KB for rsync, configurable for borg) if you already have a copy of the version before the change.

It's somewhat comparable with diff/patch but not exactly; it's worse in that change granularity is only specified on average; It's better in that it works well on binary files, does not require a specific reference diff (can reference all previous history), and efficiently supports reordering as well small changes - if you divide a 4000 line text file to four 1000-line sections and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long as a new copy, whereas a self synchronizing hash decomposition will hardly take any space for the reordered file given the original.

1 comments

Oh, I used rsync many times but I thought it simply retransmits changed files. (Oh, it needs the --checksum argument to do this, okay.)

So how do these self-synchronizing hashes work? Like a Merkle Tree? (Ah, okay https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_... )

So rsync uses 8KB for chunk size, so for a file 1GB it has 125 000 chunks. (And if every chunk needs 16 bytes of hash data to send, that's about 2MB, pretty darn efficient, especially if it can spot reorders.) Though according to Wikipedia it only does this if the target file has the same size, so adding new files to ISOs might not work in case of rsync, but still, the possibility is there for diff algos and version control systems.

No, target doesn’t have to be same size. As an optimization, if size and datetime are the same, rsync will assume no change and will not hash at all (though you can force it to).

But it will definitely use hashes when size differs (unless forced to copy whole files, or copying between local file systems)