Hacker News new | ask | show | jobs
by spangry 3497 days ago
I'm not trying to justify anything. I'm just trying to suggest you're labouring under a misapprehension. And this has nothing to do with security. I'm guessing you've heard the (good) advice that md5 is not a secure hashing function for, say, storing passwords, and then promptly joined the 'md5 is bad for all the things' cargo cult.

So while you're correct about the two images on that blog, the only reason why you'd get a clash is because the author of that blog post spent ~15 hours on an AWS GPU instance to generate the correct prefixes which, when appended to those files, results in a clash.

So, I guess if you are in the habit of grabbing random files from your hdd, loading them on to an AWS GPU instance for 15 hours (per file) and generating hash collisions, then yeah, don't use fdupes.

1 comments

fdupes is not a problem assuming wikipedia's description [1] is correct: "It first compares file sizes, partial MD5 signatures, full MD5 signatures, and then performs a byte-by-byte comparison for verification."

I was unimpressed by the md5 used in the shell script at the original link, which is using a truncated md5...

[1] https://en.wikipedia.org/wiki/Fdupes

Ok, fair enough. I would agree with the view that using md5, presumably for the faster performance, is probably not the best trade-off to be making here. Unless we're dealing with an NVMe drive (or something more exotic), you're likely to be IO bound even if using more computationally intensive hashing functions.

And if you are deduping on really fast storage, you'd get way better performance (with comparable safety) using something like xxHash64 (https://cyan4973.github.io/xxHash/).