| > that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday. This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough. Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway. The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content. Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect. (You do have me curious as to how much difference it could make, though) Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people. |
Yes, if the number of files is small enough, then "notice unique file sizes" is really the only optimization that ends up mattering much. If you have a few thousand files and they're each multiple gigabytes then hopefully you'll get lucky and no two will have the same size.
But the ideal tool should also try to handle the opposite case well too.
First, imagine if you have a huge collection of unique ~100KiB files. Now the "birthday paradox" means that size collisions are inevitable so optimizing the total I/O needed to prove two files are different starts to belp.
But the pathological case is even worse -- what if nearly all of your files are about the same size? For instance, suppose you have a program that is recording time-series data from a sensor and rotating the file every time it grows to 10MB. This sort of thing happens all the time dealing with scientific data sets -- you might have a directory with thousands of large files, all exactly the same size. If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.