Hacker News new | ask | show | jobs
by shpx 964 days ago
Checking the file size first is good, but reading every byte of every file that has at least one file of matching size and and doing a ton of XOR steps (or whatever BLAKE3 is) for all those bytes can't be optimal when different files probably differ within the first few bytes.

If you only have two files, read them X bytes at a time in parallel and XOR the bytes directly, stopping at the first different byte. For more than 2 files, to a first approximation, you have your list of file pointers, you read the first X-number of bytes for all of them and then sort your list in-place based on those first X bytes (sorting is a linear operation if all the X bytes are the same), then you iterate over that sorted list processing each run of identical X bytes in a depth-first fashion. If you have just one element that starts with a given set of X bytes, it's unique and you don't have to process it anymore, otherwise repeat the process reading the next X bytes but only for the files that started with the same bytes. X is probably 8, so that you can efficiently XOR 64 bits together for your comparisons.

1 comments

I've considered adding a "just check the first {user-configged-bytes}" mode, which would offer enough of the speed up you describe, which I think jdupes does (maybe? It's been awhile since I looked at it). I think the speed of opening the file and reading the first buffered page of bytes would dominate the time of the operation, especially if one were to do async reads (which my program does NOT do. I should look into it.)

Worth some performance measurements, anyway!