|
|
|
|
|
by ljusten
1252 days ago
|
|
In a nutshell, the algorithm computes uint64_t hash = 0;
uint64_t magic_pattern = 0b001000010000100001000...;
for (size_t n = 0; n < data.size(); ++n) {
hash = (hash << 1) + random_table[data[n]];
if ((hash & magic_pattern) == 0) {
SetChunkBoundaryAt(n);
}
}
In practice, there's more bells and whistles, but that's the gist of it. By tweaking the numbers of 1's in magic_pattern you can influence the average chunk size (distance between two boundaries). With every additional 1, your chunk size halves. There's no special handling of compressed file types. You'd probably want to do that at a much higher level, e.g. just check for extensions. |
|