| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by at_a_remove 954 days ago

So, my deduplication is about merging various archives I have of various things I've pulled off of 4chan since 2005. My algorithm is a bit backward because it starts with the proposition that all files are the same. This will turn out to save on comparisons.

Differentiators start with file size difference (obv), then an MD5 of the first one percent of the file, then a SHA-1 of the first one percent of a file, then a MD5 of the first ten percent, and so on. The byte-by-byte comparison is the last ditch effort. A differentiator is only triggered by a having two or more files together in a subgroup, and the results are stored in a database.

So we start with a massive group of all of the files. Then subgroups are made by file size. Some subgroups might only have one member and so we stop there. If not, we start with the MD5 of the first one percent ...

I will probably work on image matching, eventually.

The other reason I made it was so that after the dupes were detected, I wanted custom rules as to what to do with them.

I know Microsoft had a metadata dream about files and while I don't disagree with it, most people just ... don't do it or they do it inconsistently. I've worked with librarians, people who would agree on where to put a given book in a vast series of shelves, but when it comes to digital works, they get all sloppy. I think one of the better possible frontiers for AI is tagging out documents and images. But it's still quite aways-away. Just as an example, one would think that Microsoft would have a spellchecker for filenames by now.

1 comments

redsaz 954 days ago

I like the strategy of only hashing the first part of a file as a multiphase approach of deduping in order to quickly eliminate unique files, I wish I had done it that way with my util. Maybe for v2!

link

at_a_remove 954 days ago

I think my next pass at it will be to merge the MD5 and SHA-1 steps so that we get two outputs at the end, this way I would save on the file-reading.

But after that, I think one percent of the end of the file. Then ten percent of the start of the file, then ten percent of the end of the file ...

Given that metadata is typically at the end or beginning of a file, those seem like the best place to look for differences.

I would be open to other hashes so long as they were drop-in easy. I'm not concerned about malicious, forced collisions because they would have to overcome two different kinds of hashing, and the most it would earn is a delay, since there's always a byte-by-byte comparison at the very end.

I suspect I would similarly want to use multiple fingerprinting methods for the visual characteristics of an image file.

link

redsaz 954 days ago

> I'm not concerned about malicious, forced collisions

Consider taking a look at xxh3, possibly. It seems a pretty decent contender, hashing speed-wise, to trade off of secure hashing: https://github.com/Cyan4973/xxHash

link