| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AlienRobot 902 days ago

In Python the Adler library returns a 32 bit checksum. It works pretty well when you're comparing one file to another file. It doesn't work pretty well if you want to, for example, create a quick fingerprint that (tries to) uniquely identify tens of thousands of files.

On StackOverflow I saw someone say that they got hash collisions in MD5 (128 bit) after hashing around 20k files.

When I tried making something similar I figured if I added the size of the file in bytes to the hash that would decrease the number of hash collisions since you would need a permutation of bytes in a set of bytes of same size to generate the same MD5 hash to get a collision. Still feels random and unavoidable in the greater scheme of things, though.

2 comments

magicalhippo 902 days ago

When making a dupe detector some decades ago, I kept the filesize outside of the hash.

No need to compute the hash until there's at least two files with the same size.

link

gojomo 900 days ago

Conventional analysis of non-contrived files would suggest only a 50% chance of an MD5 collision among 2^64 (18 quintillion) files.

So any SO account of "hash collisions in MD5 (128 bit) after hashing around 20k files" may be indicative of some other bug, misreporting, or having a set of files that includes pairs intentionally-contrived to include MD5 collisions. (That's now trivial as long as you're not targeting a specific MD5 value, just intending two files to match, and have some ranges in the files where you can stuff the right result-aligning binary patterns.)

link

AlienRobot 898 days ago

I'm not well-versed in this, but I think you're wrong. A MD5 hash has 2^128 permutations, so it's like the birthday problem, you'll get 50% chance of any two files in a set having the same hash with far fewer than 2^64 files.

Besides, we're talking about probabilities. It's possible for 2 files, one after the other, to have the exact same hash just because of chance. In fact, I think there was a bitcoin bug due to this, something about making the wrong assumption that the hash of the entire blockchain and the hash of the entire blockchain + 1 block would never be the same hash, if I remember correctly.

Hashes are simply not a reliable method of uniquely identifying things. They are a convenient method but if you implement any hash-based system for identification you should also implement a escape hatch for when two different things have the same hash but must be treated differently.

link

gojomo 894 days ago

Nope. 2^64 is the number of tries needed for a 50% chance of a collision between two of the tries. You don't have to stay poorly read on this, and groundlessly allege wrongness:

https://auth0.com/blog/amp/birthday-attacks-collisions-and-p...

link