Hacker News new | ask | show | jobs
by theamk 263 days ago
This CDC is "Content Defined Chunking" - fast incremental file transfer.

Use case is to copy file over slow net, but the previous version is already there, so one can save time by only sending changed parts of the file.

Not to be confused with USB CDC ("communications device class"), an USB device protocol used to present serial ports and network cards. It can also be used to transfer files, the old PC-to-PC cables used it by implementing two network cards connected to each other.

2 comments

The clever trick is how it recognizes insertions. The standard trick of computing hashes on fixed sized blocks works efficiently for substitutions but is totally defeated by an insertion or deletion.

Instead with CDC the block boundaries are define by the content, so an insertion doesn’t change the block boundary, so it can tell the subsequent blocks are unchanged. I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

Probably worth noting that ordinary rsync can also handle insertions/deletions because it uses a rolling hash. Rsync's method is bandwidth-efficient, but not especially CPU-efficient.
> I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

You choose a number of bits (say, 12) and then evenly distribute these in a 48-bit mask; if the hash at any point has all these bits on, that defines a boundary.

not to be confused with Center of Disease Control
Especially in the context of recent (that is, last 10 years) removal of data from Center of Disease Control sources due to changing political winds.