Hacker News new | ask | show | jobs
by ARandumGuy 589 days ago
Naively, 1/(2^{hash_size_in_bits}). Which is about 1 in 4 billion odds for a 32 bit hash, and gets astronomically low at higher bit counts.

Of course, that's assuming a perfect, evenly distributed hash algorithm. And that's just the odds that any given pair of images has the same hash, not the odds that a hash conflict exists somewhere on the internet.

1 comments

You need to know the input space as well as the output space (hash size).

If you have a 32bit hash but your input is only 16bit, you'll never have a collision (and you'll be wasting a ton of space on your hashes!).

Image files can get into the megabytes though, so unless the output hash is large the potential for collisions is probably not all that low.

You do not need to know the input space.

Normal hash functions have pseudo-random outputs and they can collide even when the input space is much smaller than the output space.

In fact, I'll go run ten million values, encoded into 24 bits each, through a 40 bit hash and count the collisions. My hash of choice will be a truncated sha256.

... I got 49 collisions.