|
|
|
|
|
by shpx
964 days ago
|
|
Checking the file size first is good, but reading every byte of every file that has at least one file of matching size and and doing a ton of XOR steps (or whatever BLAKE3 is) for all those bytes can't be optimal when different files probably differ within the first few bytes. If you only have two files, read them X bytes at a time in parallel and XOR the bytes directly, stopping at the first different byte. For more than 2 files, to a first approximation, you have your list of file pointers, you read the first X-number of bytes for all of them and then sort your list in-place based on those first X bytes (sorting is a linear operation if all the X bytes are the same), then you iterate over that sorted list processing each run of identical X bytes in a depth-first fashion. If you have just one element that starts with a given set of X bytes, it's unique and you don't have to process it anymore, otherwise repeat the process reading the next X bytes but only for the files that started with the same bytes. X is probably 8, so that you can efficiently XOR 64 bits together for your comparisons. |
|
Worth some performance measurements, anyway!