Hacker News new | ask | show | jobs
by oofbey 260 days ago
The clever trick is how it recognizes insertions. The standard trick of computing hashes on fixed sized blocks works efficiently for substitutions but is totally defeated by an insertion or deletion.

Instead with CDC the block boundaries are define by the content, so an insertion doesn’t change the block boundary, so it can tell the subsequent blocks are unchanged. I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

2 comments

Probably worth noting that ordinary rsync can also handle insertions/deletions because it uses a rolling hash. Rsync's method is bandwidth-efficient, but not especially CPU-efficient.
> I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

You choose a number of bits (say, 12) and then evenly distribute these in a 48-bit mask; if the hash at any point has all these bits on, that defines a boundary.