Hacker News new | ask | show | jobs
by bodyfour 961 days ago
> when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Yes, if the number of files is small enough, then "notice unique file sizes" is really the only optimization that ends up mattering much. If you have a few thousand files and they're each multiple gigabytes then hopefully you'll get lucky and no two will have the same size.

But the ideal tool should also try to handle the opposite case well too.

First, imagine if you have a huge collection of unique ~100KiB files. Now the "birthday paradox" means that size collisions are inevitable so optimizing the total I/O needed to prove two files are different starts to belp.

But the pathological case is even worse -- what if nearly all of your files are about the same size? For instance, suppose you have a program that is recording time-series data from a sensor and rotating the file every time it grows to 10MB. This sort of thing happens all the time dealing with scientific data sets -- you might have a directory with thousands of large files, all exactly the same size. If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

1 comments

> But the ideal tool should also try to handle the opposite case well too... a huge collection of unique ~100KiB files.

and

> But the pathological case is even worse -- what if nearly all of your files are about the same size?... If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

I agree this is a worthy consideration. No sense in reading the entirety of each of those files when only reading the first block would do, in order to remove the uniques early. If I were to redo the util, it'd probably be something like:

1. Group all files into lists of same file sizes.

2. After all files are read, eliminate any groupings with just one file, these are unique files

3. Read first N bytes to pare down files in those lists, (so now the key is filesize and hash-of-first-N-bytes or even filesize and first-N-bytes if N is small enough, either way)

4. After each filesize-group is subgrouped by first-N-bytes evaluation, eliminate any subgroupings with just one file, these are unique.

5. What remains are files fairly likely to be duplicates.

5a. For users that consider this "good enough", allow them to stop here. (Some deduper tools do this)

5b. For everybody else, in order to make sure the files are dupes or not, the files can next be subgrouped by fully byte-for-byte comparison and/or hashed, whatever the user is good with.

6. The remaining groupings of two or more files are dupes.

In the end I opted not to go for this rewrite, at least not yet, because I got sidetracked thinking about how the whole reason I'm doing this in the first place is because the way I've backed up data across all my machines for years is pretty horrible, all things considered, and now I've got my wife's data to consider too, and I still want my data to be locally available on my laptop, and I don't want to entirely rely on cloud services for syncing, and, and, and... so now I'm making a tool for all that. And then when it is finished, somebody can come along and say "you could've just used owncloud and syncthing and rclone and a pair of external drives, good grief man."

Still though, I might rework the deduper logic anyway.