| So, my deduplication is about merging various archives I have of various things I've pulled off of 4chan since 2005. My algorithm is a bit backward because it starts with the proposition that all files are the same. This will turn out to save on comparisons. Differentiators start with file size difference (obv), then an MD5 of the first one percent of the file, then a SHA-1 of the first one percent of a file, then a MD5 of the first ten percent, and so on. The byte-by-byte comparison is the last ditch effort. A differentiator is only triggered by a having two or more files together in a subgroup, and the results are stored in a database. So we start with a massive group of all of the files. Then subgroups are made by file size. Some subgroups might only have one member and so we stop there. If not, we start with the MD5 of the first one percent ... I will probably work on image matching, eventually. The other reason I made it was so that after the dupes were detected, I wanted custom rules as to what to do with them. I know Microsoft had a metadata dream about files and while I don't disagree with it, most people just ... don't do it or they do it inconsistently. I've worked with librarians, people who would agree on where to put a given book in a vast series of shelves, but when it comes to digital works, they get all sloppy. I think one of the better possible frontiers for AI is tagging out documents and images. But it's still quite aways-away. Just as an example, one would think that Microsoft would have a spellchecker for filenames by now. |