Hacker News new | ask | show | jobs
by karparov 479 days ago
This can be done much faster and safer.

You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.

Initially all files are in the same bucket.

You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.

For example:

* Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.

* Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.

* Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.

* Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.

1 comments

Not surprisingly, differentiator 2 can just be the first byte (or machine word). Differentiator 3 can be the last byte (or word). At that point, 99.99% (in practice more 9s) of files are different and you’re read at most 2 blocks per file. I haven’t figured out a good differentiator 3 prior to hashing, but it’s already so rare, that it’s not worth it, in my experience.