Hacker News new | ask | show | jobs
by AlienRobot 902 days ago
In Python the Adler library returns a 32 bit checksum. It works pretty well when you're comparing one file to another file. It doesn't work pretty well if you want to, for example, create a quick fingerprint that (tries to) uniquely identify tens of thousands of files.

On StackOverflow I saw someone say that they got hash collisions in MD5 (128 bit) after hashing around 20k files.

When I tried making something similar I figured if I added the size of the file in bytes to the hash that would decrease the number of hash collisions since you would need a permutation of bytes in a set of bytes of same size to generate the same MD5 hash to get a collision. Still feels random and unavoidable in the greater scheme of things, though.

2 comments

When making a dupe detector some decades ago, I kept the filesize outside of the hash.

No need to compute the hash until there's at least two files with the same size.

Conventional analysis of non-contrived files would suggest only a 50% chance of an MD5 collision among 2^64 (18 quintillion) files.

So any SO account of "hash collisions in MD5 (128 bit) after hashing around 20k files" may be indicative of some other bug, misreporting, or having a set of files that includes pairs intentionally-contrived to include MD5 collisions. (That's now trivial as long as you're not targeting a specific MD5 value, just intending two files to match, and have some ranges in the files where you can stuff the right result-aligning binary patterns.)

I'm not well-versed in this, but I think you're wrong. A MD5 hash has 2^128 permutations, so it's like the birthday problem, you'll get 50% chance of any two files in a set having the same hash with far fewer than 2^64 files.

Besides, we're talking about probabilities. It's possible for 2 files, one after the other, to have the exact same hash just because of chance. In fact, I think there was a bitcoin bug due to this, something about making the wrong assumption that the hash of the entire blockchain and the hash of the entire blockchain + 1 block would never be the same hash, if I remember correctly.

Hashes are simply not a reliable method of uniquely identifying things. They are a convenient method but if you implement any hash-based system for identification you should also implement a escape hatch for when two different things have the same hash but must be treated differently.

Nope. 2^64 is the number of tries needed for a 50% chance of a collision between two of the tries. You don't have to stay poorly read on this, and groundlessly allege wrongness:

https://auth0.com/blog/amp/birthday-attacks-collisions-and-p...