| >which is probably really unique Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges. >maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient. >if you just compare the bytes there is no chance of hash collision You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives. But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all. You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal. |
To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).
The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).