| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by redsaz 964 days ago

Writing a program like this is one of the first exercises I give myself when learning a new programming language, because it touches a little bit of everything (reading files, output, CLI, using libraries, hashmaps, functions, loops, conditionals, etc) and isn't too onerous to implement.

My latest (it's a few years old at this point) is lsdup (rust version) using blake3 for hashing the content: https://github.com/redsaz/lsdup/

All it does is list the groups of duplicate files, grouped by hash, groups ordered by size. I'll usually pipe the output to a file, then do whatever I want to the list, and run a different script to process the resulting list. It works fine enough.

2 comments

shpx 963 days ago

Checking the file size first is good, but reading every byte of every file that has at least one file of matching size and and doing a ton of XOR steps (or whatever BLAKE3 is) for all those bytes can't be optimal when different files probably differ within the first few bytes.

If you only have two files, read them X bytes at a time in parallel and XOR the bytes directly, stopping at the first different byte. For more than 2 files, to a first approximation, you have your list of file pointers, you read the first X-number of bytes for all of them and then sort your list in-place based on those first X bytes (sorting is a linear operation if all the X bytes are the same), then you iterate over that sorted list processing each run of identical X bytes in a depth-first fashion. If you have just one element that starts with a given set of X bytes, it's unique and you don't have to process it anymore, otherwise repeat the process reading the next X bytes but only for the files that started with the same bytes. X is probably 8, so that you can efficiently XOR 64 bits together for your comparisons.

link

redsaz 963 days ago

I've considered adding a "just check the first {user-configged-bytes}" mode, which would offer enough of the speed up you describe, which I think jdupes does (maybe? It's been awhile since I looked at it). I think the speed of opening the file and reading the first buffered page of bytes would dominate the time of the operation, especially if one were to do async reads (which my program does NOT do. I should look into it.)

Worth some performance measurements, anyway!

link

bodyfour 964 days ago

(Warning: rambling ahead, since in the past I've spent a decent amount of time on the same problem)

> using blake3 for hashing the content

Using any hash algorithm isn't a good design, at least for SSD storage.

First, it invites some degree of cryptographic risk. i.e. if a collision is ever discovered in the hash then your program can be tricked into discovering a false duplicate. Whether that is a problem depends on the use case, but isn't ideal.

Worse, though, is that it just doesn't make any sense algorithmically. Consider the simple example of having two 1TB files and you want to discover if they're identical. You could do a cryptographic hash of each of them and (barring any malicious collisions) tell whether they're the same.

However, now imagine that those two files differed in the very first byte. Now it seems that you could have figured out they're different a lot faster, right?

So what you really want to do is read both files chunk-by-chunk (probably some number of disk blocks at a time) so you can detect the files-are-different case early. (After all, the common case of files that differ is that they'll differ early!). You could still compare the chunks using a cryptographic hash, but now there is no benefit: you can just compare the two blocks of memory directly faster than you can take a crypto hash of them. C's memcmp() works fine but since you are probably working on fixed-sized aligned blocks you can do slightly better with a hand-rolled SIMD loop.

The one advantage that a cryptographic hash gives you is that it provides a memory-efficient way of reading all of file A and then all of file B. Therefore if disk seeks are expensive, it can be a benefit (again, if you can accept the risk of a malicious false-positive). However if the files are SSD backed and you have enough RAM to read a decent sized chunk of each file into memory simultaneously this ceases to be a problem.

To extend this from 2 files to many, first stat() all of the files and group them by file size. After all, two files of different sizes aren't going to ever be equal. You can think of the size as a "hash" of the contents that you get for free. Any files that are 0 bytes are (of course) duplicates of each other so you can just return those as "hits". Any file that has a unique size (which thankfully is often the common case) is not a duplicate of anything and you don't even have to open it. If you care about hardlinks, you also want to track the inode numbers at this step so you can avoid comparing two files that are actually the same inode.

Then for any group of files with the same size, read each block in turn. The tricky part is that you want to subdivide the group by the contents of the file. i.e. if you have 4 same-sized files and two have contents "AAAA..." and the other two have contents "BBBB..." then you didn't find any unique files yet, but you need to split the set of 4 files into two new subsets of 2 files each. Data-structure wise, keep a worklist of items containing of (1) a set of (at least two) files that could be duplicates and (2) how many bytes you've already verified are the same among them. Then when you encounter this "split" scenario you can just push a new worklist item and continue working on one of the groups.

The bit you need to be careful of is not introducing bad worst-case performance here in a hard case (e.g. you have a million potentially-duplicate files, and reading one block separates them into 5000 groups each with 200 members). Just a decent hash-table is enough with the key being the whole disk block.

Because maintaining this hash table adds a bit of complication, it can seem worthwhile to build a special-case for groups of exactly two files, where you can just do the simple read-and-compare. But then you can re-combine this by observing that the case where you have N files that are all the same is also worth optimizing for. So instead, just do the read-and-compare on all of the files until you find the first two that are different -- only then start building the hashtable when you have two different blocks and more files still to read. That way the common-ish case where you have many files all the same can be handled as fast as possible.

There are things that operating systems could provide that would make this even better:

1. It would be nice if it were easy to estimate the likely seek cost before picking an algorithm. If the file system would simply indicate whether it thought it was backed by spinning rust that would be great.

2. Also if you could ask for the filesystem to read a non-fixed number of bytes (without resorting to async-I/O) that would be helpful. ("Give me up to 1MB of the file, but stop early if you have to seek to a new extent"). Having the ability to basically read the file extent-by-extent instead of block-by-block would mean we could be seek-efficient while reading multiple files in parallel.

3. Finally, it would be great if there was some portable way to access any block-hash metadata the filesystem itself keeps. A filesystem that does its own deduplication work might already know that two blocks must have different contents without reading them because it already scanned them. On the flip-side, if the filesystem supports copy-on-write file snapshots then it could tell us in advance that two inodes are really the same underlying file before we open them.

link

at_a_remove 963 days ago

So, my deduplication is about merging various archives I have of various things I've pulled off of 4chan since 2005. My algorithm is a bit backward because it starts with the proposition that all files are the same. This will turn out to save on comparisons.

Differentiators start with file size difference (obv), then an MD5 of the first one percent of the file, then a SHA-1 of the first one percent of a file, then a MD5 of the first ten percent, and so on. The byte-by-byte comparison is the last ditch effort. A differentiator is only triggered by a having two or more files together in a subgroup, and the results are stored in a database.

So we start with a massive group of all of the files. Then subgroups are made by file size. Some subgroups might only have one member and so we stop there. If not, we start with the MD5 of the first one percent ...

I will probably work on image matching, eventually.

The other reason I made it was so that after the dupes were detected, I wanted custom rules as to what to do with them.

I know Microsoft had a metadata dream about files and while I don't disagree with it, most people just ... don't do it or they do it inconsistently. I've worked with librarians, people who would agree on where to put a given book in a vast series of shelves, but when it comes to digital works, they get all sloppy. I think one of the better possible frontiers for AI is tagging out documents and images. But it's still quite aways-away. Just as an example, one would think that Microsoft would have a spellchecker for filenames by now.

link

redsaz 963 days ago

I like the strategy of only hashing the first part of a file as a multiphase approach of deduping in order to quickly eliminate unique files, I wish I had done it that way with my util. Maybe for v2!

link

at_a_remove 963 days ago

I think my next pass at it will be to merge the MD5 and SHA-1 steps so that we get two outputs at the end, this way I would save on the file-reading.

But after that, I think one percent of the end of the file. Then ten percent of the start of the file, then ten percent of the end of the file ...

Given that metadata is typically at the end or beginning of a file, those seem like the best place to look for differences.

I would be open to other hashes so long as they were drop-in easy. I'm not concerned about malicious, forced collisions because they would have to overcome two different kinds of hashing, and the most it would earn is a delay, since there's always a byte-by-byte comparison at the very end.

I suspect I would similarly want to use multiple fingerprinting methods for the visual characteristics of an image file.

link

redsaz 963 days ago

> I'm not concerned about malicious, forced collisions

Consider taking a look at xxh3, possibly. It seems a pretty decent contender, hashing speed-wise, to trade off of secure hashing: https://github.com/Cyan4973/xxHash

link

redsaz 963 days ago

This isn't the first time I've heard concerns about using hashes for checking file equality. I've considered adding a "paranoid mode" to do the direct byte-for-byte checks for such folks that don't even want to introduce a so-remote-it's-virtually-impossible theoretical chance for a collision to occur.

I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere.

link

bodyfour 963 days ago

"The math" only matters for random collisions, which are effectively impossible (less likely than the CPU malfunctioning). However that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.

Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.

link

redsaz 963 days ago

> that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough.

Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D

Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway.

The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect.

(You do have me curious as to how much difference it could make, though)

Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people.

link

bodyfour 963 days ago

> when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Yes, if the number of files is small enough, then "notice unique file sizes" is really the only optimization that ends up mattering much. If you have a few thousand files and they're each multiple gigabytes then hopefully you'll get lucky and no two will have the same size.

But the ideal tool should also try to handle the opposite case well too.

First, imagine if you have a huge collection of unique ~100KiB files. Now the "birthday paradox" means that size collisions are inevitable so optimizing the total I/O needed to prove two files are different starts to belp.

But the pathological case is even worse -- what if nearly all of your files are about the same size? For instance, suppose you have a program that is recording time-series data from a sensor and rotating the file every time it grows to 10MB. This sort of thing happens all the time dealing with scientific data sets -- you might have a directory with thousands of large files, all exactly the same size. If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

link

redsaz 963 days ago

> But the ideal tool should also try to handle the opposite case well too... a huge collection of unique ~100KiB files.

and

> But the pathological case is even worse -- what if nearly all of your files are about the same size?... If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

I agree this is a worthy consideration. No sense in reading the entirety of each of those files when only reading the first block would do, in order to remove the uniques early. If I were to redo the util, it'd probably be something like:

1. Group all files into lists of same file sizes.

2. After all files are read, eliminate any groupings with just one file, these are unique files

3. Read first N bytes to pare down files in those lists, (so now the key is filesize and hash-of-first-N-bytes or even filesize and first-N-bytes if N is small enough, either way)

4. After each filesize-group is subgrouped by first-N-bytes evaluation, eliminate any subgroupings with just one file, these are unique.

5. What remains are files fairly likely to be duplicates.

5a. For users that consider this "good enough", allow them to stop here. (Some deduper tools do this)

5b. For everybody else, in order to make sure the files are dupes or not, the files can next be subgrouped by fully byte-for-byte comparison and/or hashed, whatever the user is good with.

6. The remaining groupings of two or more files are dupes.

In the end I opted not to go for this rewrite, at least not yet, because I got sidetracked thinking about how the whole reason I'm doing this in the first place is because the way I've backed up data across all my machines for years is pretty horrible, all things considered, and now I've got my wife's data to consider too, and I still want my data to be locally available on my laptop, and I don't want to entirely rely on cloud services for syncing, and, and, and... so now I'm making a tool for all that. And then when it is finished, somebody can come along and say "you could've just used owncloud and syncthing and rclone and a pair of external drives, good grief man."

Still though, I might rework the deduper logic anyway.

link