Hacker News new | ask | show | jobs
by gojomo 900 days ago
Conventional analysis of non-contrived files would suggest only a 50% chance of an MD5 collision among 2^64 (18 quintillion) files.

So any SO account of "hash collisions in MD5 (128 bit) after hashing around 20k files" may be indicative of some other bug, misreporting, or having a set of files that includes pairs intentionally-contrived to include MD5 collisions. (That's now trivial as long as you're not targeting a specific MD5 value, just intending two files to match, and have some ranges in the files where you can stuff the right result-aligning binary patterns.)

1 comments

I'm not well-versed in this, but I think you're wrong. A MD5 hash has 2^128 permutations, so it's like the birthday problem, you'll get 50% chance of any two files in a set having the same hash with far fewer than 2^64 files.

Besides, we're talking about probabilities. It's possible for 2 files, one after the other, to have the exact same hash just because of chance. In fact, I think there was a bitcoin bug due to this, something about making the wrong assumption that the hash of the entire blockchain and the hash of the entire blockchain + 1 block would never be the same hash, if I remember correctly.

Hashes are simply not a reliable method of uniquely identifying things. They are a convenient method but if you implement any hash-based system for identification you should also implement a escape hatch for when two different things have the same hash but must be treated differently.

Nope. 2^64 is the number of tries needed for a 50% chance of a collision between two of the tries. You don't have to stay poorly read on this, and groundlessly allege wrongness:

https://auth0.com/blog/amp/birthday-attacks-collisions-and-p...