| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bodyfour 961 days ago

"The math" only matters for random collisions, which are effectively impossible (less likely than the CPU malfunctioning). However that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.

Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.

1 comments

redsaz 961 days ago

> that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough.

Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D

Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway.

The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect.

(You do have me curious as to how much difference it could make, though)

Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people.

link

bodyfour 961 days ago

> when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Yes, if the number of files is small enough, then "notice unique file sizes" is really the only optimization that ends up mattering much. If you have a few thousand files and they're each multiple gigabytes then hopefully you'll get lucky and no two will have the same size.

But the ideal tool should also try to handle the opposite case well too.

First, imagine if you have a huge collection of unique ~100KiB files. Now the "birthday paradox" means that size collisions are inevitable so optimizing the total I/O needed to prove two files are different starts to belp.

But the pathological case is even worse -- what if nearly all of your files are about the same size? For instance, suppose you have a program that is recording time-series data from a sensor and rotating the file every time it grows to 10MB. This sort of thing happens all the time dealing with scientific data sets -- you might have a directory with thousands of large files, all exactly the same size. If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

link

redsaz 961 days ago

> But the ideal tool should also try to handle the opposite case well too... a huge collection of unique ~100KiB files.

and

> But the pathological case is even worse -- what if nearly all of your files are about the same size?... If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

I agree this is a worthy consideration. No sense in reading the entirety of each of those files when only reading the first block would do, in order to remove the uniques early. If I were to redo the util, it'd probably be something like:

1. Group all files into lists of same file sizes.

2. After all files are read, eliminate any groupings with just one file, these are unique files

3. Read first N bytes to pare down files in those lists, (so now the key is filesize and hash-of-first-N-bytes or even filesize and first-N-bytes if N is small enough, either way)

4. After each filesize-group is subgrouped by first-N-bytes evaluation, eliminate any subgroupings with just one file, these are unique.

5. What remains are files fairly likely to be duplicates.

5a. For users that consider this "good enough", allow them to stop here. (Some deduper tools do this)

5b. For everybody else, in order to make sure the files are dupes or not, the files can next be subgrouped by fully byte-for-byte comparison and/or hashed, whatever the user is good with.

6. The remaining groupings of two or more files are dupes.

In the end I opted not to go for this rewrite, at least not yet, because I got sidetracked thinking about how the whole reason I'm doing this in the first place is because the way I've backed up data across all my machines for years is pretty horrible, all things considered, and now I've got my wife's data to consider too, and I still want my data to be locally available on my laptop, and I don't want to entirely rely on cloud services for syncing, and, and, and... so now I'm making a tool for all that. And then when it is finished, somebody can come along and say "you could've just used owncloud and syncthing and rclone and a pair of external drives, good grief man."

Still though, I might rework the deduper logic anyway.

link