Hacker News new | ask | show | jobs
by jhhh 495 days ago
I can see the incident was a jumping off point to talk about bad APIs (bcrypt probably should error >72) but it sounds like the actual bug was they weren't checking the value in the cache matched the data they used in the hash for the key. The authentication cache check should survive any arbitrarily bad hashing algorithm because all of them are going to have collisions (pigeonhole principal). Even an arbitrarily 'strong' hash function with no input truncation, as long as it has a fixed width result, will have this property. Thus, any arguing in the comments here about different hash functions with different truncation properties is moot.

The analogy is something like creating a hash map whose insert function computes the slot for the key and unconditionally puts the value there instead of checking if the keys are the same during a collision. No amount of tinkering with the hash function fixes this problem. The algorithm is wrong. A hashmap should survive and be correct even giving it a hash function that always returns 4.

4 comments

Something similar happens in my company too. In a particular place, we use the hash of a string as the key in a hashmap instead of the string itself, because the hash is smaller and is easier to compare after the initial map has been made. It is a 64bit hash too. I have been crying about this everytime it comes up, and the response is, it will never happen. My problem is that we will never know if it ever happens too.
Isn't a hash map supposed to already be internally hashing the string for you and correctly handling collisions?
Yikes.

It's valid to assume "it will never happen" for 128 bits or more (if the hash function isn't broken) since chance of a random collision is astronomically small, but a collision in 64 bits is within realm of possibility (50% chance of hitting a dupe among 2^32 items).

> valid, 128 bits

The birthday paradox is a thing. If you have 128 bits of entropy, you expect the 50% mark to be proportional to 64-bit keys, not 128 bits. 64 bits is a lot, but in my current $WORK project if I only had 128 bits of entropy the chance of failure any given year would be 0.16%. That's not a lot, but it's not a negligible amount either.

Bigger companies care more. Google has a paper floating around about how "64 bits isn't as big as it used to be" or something to that effect, complaining about how they're running out of 64-bit keys and can't blindly use 128-bit random keys to prevent duplication.

> bits of entropy

Consumer-grade hash functions are often the wrong place to look for best-case collision chances. Take, e.g., the default Python hash function which hashes each integer to itself (mod 2^64). The collision chance for truly random data is sufficiently low, but every big dictionary I've seen in a real-world Python project has had a few collisions. Other languages usually make similar tradeoffs (almost nobody uses crytographic hashes by default since they're too slow). I wouldn't, by default, trust a generic 1-million-bit hash to not have collisions in a program of any size and complexity. 128 bits, even with low enough execution counts to otherwise make sense, is also unlikely to pan out in the real world.

I agree that 128 bits is on the lower end of "never", but you still need to store trillions of hashes to have a one-in-a-trillion chance to see a collision (and that's already the overall probability, you don't multiply it by the number of inserts to get 1:1 chance :) I don't think anybody in the world has ever seen a collision of a cryptographically strong 128-bit hash that wasn't a bug or attack.

Birthday paradox applies when you store the items together (it's a chance of collision against any existing item in the set), so overall annual hashing churn isn't affected (more hashes against a smaller set doesn't increase your collision probability as quickly).

Based on currently available public estimates, Google stores around 2^75 bytes, most of that backed by a small number of very general-purpose object stores. A lot of that is from larger files, but you're still approaching birthday-paradox numbers for in-the-wild 128-bit hash collisions.
Hashtables have collisions because they don't use all bits of hash, they calculate index=hash%capacity. It doesn't matter, how you calculate the hash, if you have only a few places to insert an item, they will collide.
Right, but the problem they were describing was storing the "hash" in a hash table, not storing the item using a hash. For that, it absolutely matters, and the fact that it was a 128-bit hash IMO isn't good enough because the hash function itself likely sucks.
This makes no sense - it’s a hash map because it hashes things for you…
A hash map still stores the entire key, which may be undesirable if the key datum can be large.
Not sure about that. A hash function suitable for security sensitive work, used properly, should make a collision so unlikely that you can basically forget it that it's even possible.

Think about it, that's what hashing passwords relies on. We don't store a plaintext password for a final check if the password hash matches, we count on a collision being basically impossible.

A hashmap is different, because it's using a much weaker hash function with far fewer security guarantees.

Plus, you're assuming the original values are even kept around for comparison. The cache key likely just mapped to something simple like a boolean or status flag.

I would guess that they felt comfortable that the bcrypt output (192 bits) is enough that collisions are very unlikely. If these were already partitioned by customer, rather than being a single cache for the entire Okta userbase that seems fine. You're going to have weird cosmic ray bugs more often than a natural collision.

Now, the data structure they're using for a cache will use some sort of hash table, likely in memory, so maybe they've got the 192-bit bcrypt "key" and then that's hashed again, perhaps well or perhaps badly [e.g. C++ really likes using the identity function so hash(12345) = 12345] but then a modulo function is applied to find the key in an index and then we go groping about to look for the Key + Value pair. That part the API probably took care of, so even if the hash has size 6-bits, the full 192-bit key was checked. But the original data (userid:username:password) is not compared, only that 192-bit cache key.

I've experienced that Most incidents, whether it's security, performance, or availability, etc. rarely have one single thing that goes wrong. Theres's a chain of events that happen.

Poor API design can make it easier for other contributing factors (checking cache here, but could also be not running load tests, not fuzzing, human error, etc.) to cause incidents.

I'm glad to see this come out, plus which libraries handle out of bounds conditions with errors vs. fix-up the input to cause silent failures.