| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btschaegg 2111 days ago

I think you misunderstood how the rolling hash is used in this context. It's not used to address a chunk; you'd use a plain old cryptographic hash function for that.

The rolling hash is used to find the chunk boundary: Hash a window before every byte (which is cheap with a rolling hash) and compare it against a defined bit mask. For example: Check if the first 20 bytes are zero. If so, you'd get chunks with about 2^20 bytes (1 MiB) average length.

As a good explanation, I'd encourage you to look at borgbackup's internals documentation: https://borgbackup.readthedocs.io/en/stable/internals.html

1 comments

hinkley 2111 days ago

I think they understood just fine.

If I discover that the file I want to publish shares a range with an existing file, that does very little because the existing file has already chosen its chunk boundaries and I can’t influence those. That ship has sailed.

I can only benefit if the a priori chunks are small enough that some subset of the identified match is still addressable. And then I may only get half of a two thirds of the improvement I was after.

link

dchest 2111 days ago

that does very little because the existing file has already chosen its chunk boundaries

If they both used the same rolling hash function on the same or similar data, regardless of the initial and final boundary and regardless of when they chose the boundaries, they will share many chunks with high probability. That’s just how splitting with rolling hashes work. They produce variable-length chunks.

link

tleb_ 2111 days ago

The idea is that on none random data, you are able to use a heuristic that would create variable-sized chunks that fit the data. The simplest way seems to detect padding zeros and start a new block on the first following none zero byte. There probably are other ways, knowing the data type should help.

link

hinkley 2110 days ago

That seems fairly unlikely. Not a lot of big files have zero padding, and if they did them compress them. It will reduce your transfers more than and range substitutions ever will.

link