| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 493 days ago

> valid, 128 bits

The birthday paradox is a thing. If you have 128 bits of entropy, you expect the 50% mark to be proportional to 64-bit keys, not 128 bits. 64 bits is a lot, but in my current $WORK project if I only had 128 bits of entropy the chance of failure any given year would be 0.16%. That's not a lot, but it's not a negligible amount either.

Bigger companies care more. Google has a paper floating around about how "64 bits isn't as big as it used to be" or something to that effect, complaining about how they're running out of 64-bit keys and can't blindly use 128-bit random keys to prevent duplication.

> bits of entropy

Consumer-grade hash functions are often the wrong place to look for best-case collision chances. Take, e.g., the default Python hash function which hashes each integer to itself (mod 2^64). The collision chance for truly random data is sufficiently low, but every big dictionary I've seen in a real-world Python project has had a few collisions. Other languages usually make similar tradeoffs (almost nobody uses crytographic hashes by default since they're too slow). I wouldn't, by default, trust a generic 1-million-bit hash to not have collisions in a program of any size and complexity. 128 bits, even with low enough execution counts to otherwise make sense, is also unlikely to pan out in the real world.

2 comments

pornel 493 days ago

I agree that 128 bits is on the lower end of "never", but you still need to store trillions of hashes to have a one-in-a-trillion chance to see a collision (and that's already the overall probability, you don't multiply it by the number of inserts to get 1:1 chance :) I don't think anybody in the world has ever seen a collision of a cryptographically strong 128-bit hash that wasn't a bug or attack.

Birthday paradox applies when you store the items together (it's a chance of collision against any existing item in the set), so overall annual hashing churn isn't affected (more hashes against a smaller set doesn't increase your collision probability as quickly).

link

hansvm 492 days ago

Based on currently available public estimates, Google stores around 2^75 bytes, most of that backed by a small number of very general-purpose object stores. A lot of that is from larger files, but you're still approaching birthday-paradox numbers for in-the-wild 128-bit hash collisions.

link

GoblinSlayer 493 days ago

Hashtables have collisions because they don't use all bits of hash, they calculate index=hash%capacity. It doesn't matter, how you calculate the hash, if you have only a few places to insert an item, they will collide.

link

hansvm 492 days ago

Right, but the problem they were describing was storing the "hash" in a hash table, not storing the item using a hash. For that, it absolutely matters, and the fact that it was a 128-bit hash IMO isn't good enough because the hash function itself likely sucks.

link