| (Warning: rambling ahead, since in the past I've spent a decent amount of time on the same problem) > using blake3 for hashing the content Using any hash algorithm isn't a good design, at least for SSD storage. First, it invites some degree of cryptographic risk. i.e. if a collision is ever discovered in the hash then your program can be tricked into discovering a false duplicate. Whether that is a problem depends on the use case, but isn't ideal. Worse, though, is that it just doesn't make any sense algorithmically. Consider the simple example of having two 1TB files and you want to discover if they're identical. You could do a cryptographic hash of each of them and (barring any malicious collisions) tell whether they're the same. However, now imagine that those two files differed in the very first byte. Now it seems that you could have figured out they're different a lot faster, right? So what you really want to do is read both files chunk-by-chunk (probably some number of disk blocks at a time) so you can detect the files-are-different case early. (After all, the common case of files that differ is that they'll differ early!). You could still compare the chunks using a cryptographic hash, but now there is no benefit: you can just compare the two blocks of memory directly faster than you can take a crypto hash of them. C's memcmp() works fine but since you are probably working on fixed-sized aligned blocks you can do slightly better with a hand-rolled SIMD loop. The one advantage that a cryptographic hash gives you is that it provides a memory-efficient way of reading all of file A and then all of file B. Therefore if disk seeks are expensive, it can be a benefit (again, if you can accept the risk of a malicious false-positive). However if the files are SSD backed and you have enough RAM to read a decent sized chunk of each file into memory simultaneously this ceases to be a problem. To extend this from 2 files to many, first stat() all of the files and group them by file size. After all, two files of different sizes aren't going to ever be equal. You can think of the size as a "hash" of the contents that you get for free. Any files that are 0 bytes are (of course) duplicates of each other so you can just return those as "hits". Any file that has a unique size (which thankfully is often the common case) is not a duplicate of anything and you don't even have to open it. If you care about hardlinks, you also want to track the inode numbers at this step so you can avoid comparing two files that are actually the same inode. Then for any group of files with the same size, read each block in turn. The tricky part is that you want to subdivide the group by the contents of the file. i.e. if you have 4 same-sized files and two have contents "AAAA..." and the other two have contents "BBBB..." then you didn't find any unique files yet, but you need to split the set of 4 files into two new subsets of 2 files each. Data-structure wise, keep a worklist of items containing of (1) a set of (at least two) files that could be duplicates and (2) how many bytes you've already verified are the same among them. Then when you encounter this "split" scenario you can just push a new worklist item and continue working on one of the groups. The bit you need to be careful of is not introducing bad worst-case performance here in a hard case (e.g. you have a million potentially-duplicate files, and reading one block separates them into 5000 groups each with 200 members). Just a decent hash-table is enough with the key being the whole disk block. Because maintaining this hash table adds a bit of complication, it can seem worthwhile to build a special-case for groups of exactly two files, where you can just do the simple read-and-compare. But then you can re-combine this by observing that the case where you have N files that are all the same is also worth optimizing for. So instead, just do the read-and-compare on all of the files until you find the first two that are different -- only then start building the hashtable when you have two different blocks and more files still to read. That way the common-ish case where you have many files all the same can be handled as fast as possible. There are things that operating systems could provide that would make this even better: 1. It would be nice if it were easy to estimate the likely seek cost before picking an algorithm. If the file system would simply indicate whether it thought it was backed by spinning rust that would be great. 2. Also if you could ask for the filesystem to read a non-fixed number of bytes (without resorting to async-I/O) that would be helpful. ("Give me up to 1MB of the file, but stop early if you have to seek to a new extent"). Having the ability to basically read the file extent-by-extent instead of block-by-block would mean we could be seek-efficient while reading multiple files in parallel. 3. Finally, it would be great if there was some portable way to access any block-hash metadata the filesystem itself keeps. A filesystem that does its own deduplication work might already know that two blocks must have different contents without reading them because it already scanned them. On the flip-side, if the filesystem supports copy-on-write file snapshots then it could tell us in advance that two inodes are really the same underlying file before we open them. |
Differentiators start with file size difference (obv), then an MD5 of the first one percent of the file, then a SHA-1 of the first one percent of a file, then a MD5 of the first ten percent, and so on. The byte-by-byte comparison is the last ditch effort. A differentiator is only triggered by a having two or more files together in a subgroup, and the results are stored in a database.
So we start with a massive group of all of the files. Then subgroups are made by file size. Some subgroups might only have one member and so we stop there. If not, we start with the MD5 of the first one percent ...
I will probably work on image matching, eventually.
The other reason I made it was so that after the dupes were detected, I wanted custom rules as to what to do with them.
I know Microsoft had a metadata dream about files and while I don't disagree with it, most people just ... don't do it or they do it inconsistently. I've worked with librarians, people who would agree on where to put a given book in a vast series of shelves, but when it comes to digital works, they get all sloppy. I think one of the better possible frontiers for AI is tagging out documents and images. But it's still quite aways-away. Just as an example, one would think that Microsoft would have a spellchecker for filenames by now.