Hacker News new | ask | show | jobs
by tptacek 499 days ago
Bcrypt is a password hash, not a KDF, which is the way it was used in this API. It's super unclear to me why they wanted a string-based KDF here at all; does anyone have more context?

I've in the past been annoying about saying I think we should just call all password hashes "KDFs", but here's a really good illustration of why I was definitely wrong about that. A KDF is a generally-useful bit of cryptography joinery; a password hash has exactly one job.

8 comments

They didn't want a KDF, as far as I know, but they wanted a hash function with unlimited input size.

Including the username in the hash input gives you guaranteed domain separation between users that you don't get from salts/nonces. Its a generally good idea if you have a hash function with unlimited input size (all modern cryptographic hash functions except bcrypt have unlimited input size).

> but they wanted a hash function with unlimited input size

I'm kind of baffled how they came to use bcrypt for this. Bcrypt is not exactly subtle about only supporting 72 bytes of input. And this is at a company who provides auth as a service; I've got to imagine they had multiple engineers who knew this (I guess not working on that code). Hell, I know this and I've only used bcrypt twice and I'm nowhere near a security/crypto guy.

BCrypt should loudly fail if more than 72 bytes are sent to its input.
Maybe it should. Discarding the rest of the bytes works fine for passwords, though. I guess that's just not sufficient.
In my book, discarding entropy is a generally dumb thing to do. Passwords are usually under 72 chars, but a lot of people use concatenations of usernames and passwords in their hash to get guaranteed domain separation between users.
They clearly wanted something stronger than "a hash function" or they'd have reached for weaker cryptographic hashes.
They wanted a hard-to-compute cryptographic hash function. Today, that means bcrypt or something with a KDF construction. However, they needed one with unlimited input size, which rules out bcrypt.
Or just a hash of the bcrypt hash, for the password!

I don't like using thought-stopping cliches any more than anybody else does, but this design feels a little cargo-culted. All this stuff follows the more fundamental question of "why is the password mixed into a cache key"?

Yeah, I think both of the following would have worked if they wanted the password involved in a cache key and they wanted bcrypt to be used:

* bcrypt(SHA-512(PW || stuff))

* SHA(stuff || bcrypt(PW))

Disclaimer: Not cryptography advice.

It's still unclear to me why the password is in there.

> It's still unclear to me why the password is in there.

Perhaps they did not want to apply cache invalidation purely by the passage of time, or want that passage of time to be long, but wanted to treat a credentials update as a cache invalidating event. A safer way to implement that would perhaps be to have a concept of a version of an account, incremented when authentication options or other significant properties change, and including that in the cache key.

I'm not sure why it would matter though: even if a credentials change does invalidate the cache from the PoV of the user looking up information, the information is potentially still in the cache so could still be referred to by someone else who has gained knowledge of the old credentials.

Perhaps the password is used as part of the cache Key so that a password update implicitly invalidates the cache?
hmac-bcrypt solves that problem very well, and should replace plain bcrypt: https://github.com/epixoip/hmac-bcrypt
For 'unlimited' input size it should be SHA-3-512. Maybe too slow, but Bcrypt is slower, right? Less things to go wrong too.
bcrypt-pbkdf (used in OpenSSH) exists for that purpose.
Would scrypt solve this problem? Or is it same as bcrypt? Or is scrypt depending on the hardware?
So let me take on the burden of stupid here: how are a password hash and a string-based KDF different? (I mean, the oldest well-known example of the former literally calls itself a PBKDF.) I understand this particular function from strings to large fixed numbers was limited in the length of the string it would accept, and I agree that’s a problem, but it feels like a problem orthogonal to the distinction you’re drawing.
There is a huge amount of overlap, in that most modern password hashes can be used and are sort of fundamentally based on the idea of a string KDF. The big differences are, as you can see here, that a string or password KDF will take an arbitrarily long string, and that a KDF produces some specific raw random output, and a password hash produces a verifier string (the raw random hash, plus usually some metadata encoded some way; for bcrypt, that's the algorithm and cost and salt, for instance).
A potential distinction is entropy preservation. For password hashes you usually want to preserve as much entropy as possible although one could argue that beyond 256 bits of output it may not matter (only one-time pads would suffer from smaller output). KDFs on the other hand must output a correctly-sized key for a particular cipher and so have further constraints on output choices (and potentially avoiding weak keys, e.g. for elliptic curve point generation).
Password hash functions are designed to be slow, are designed to be use with salts, and may have low entropy inputs. Being slow is a waste for (true) KDFs, salts aren't relevant (although nonces may be), and are designed for high entropy inputs.

The naming overlap between the two is bad, so the industry has tried to move towards naming the two differently. Password hashing functions are not ideal KDFs, even though a particular primitive may be secure for use as a KDF. That's a root of some of the confusion.

String KDFs are also slow. That's the basic strategy for making high-entropy keys out of low-entropy inputs.
Password hash functions are intentionally designed to be extremely slow (at an exponential scale). While this makes perfect sense for password hashing, it is nonsensical to have such an intentional, configurable slowdown mechanism in KDFs. KDFs have computational cost, but having the kind of extreme slowdown that password hash functions have makes no sense for purpose designed KDFs, especially when needing to derive many keys.
The value is the combination of userid, username, and password, so in threads on other platforms people have hypothesised that the developer tried to play it safe and use a password hash because of the password's presence.

Also I'm not sure the average developer understands the distinction.

  > Also I'm not sure the average developer understands the distinction.
I'm an average developer. I'm not sure that I understand exactly. What should I be reading, or what can you tell me?

Thank you!

> I've in the past been annoying about saying I think we should just call all password hashes "KDFs"

I remember :-): https://news.ycombinator.com/item?id=42899432

Hypothetically here is one way it might have played out

Product - we need to provide service availability even if the AD is down

Engineer - Ok, may be we can store the ~credentials in cache

Security - oh, in that case make sure everything in cache is hashed properly with the recommended Bcrypt algorithm

Engineer - We got the approval from the security, we are in much safer zone, lets deliver and get a win

Unfortunately the industry defined Bcrypt as a KDF for some time, even though it is better named as a "password hashing function". Cryptography has a history of being bad at picking good names for new work.

In addition to (true) KDFs, people often want a HKDF (HMAC-based Key Derivation Function) or hierarchical deterministic key derivation function (HDK).

I believe there have been earlier protocols where the user’s secrets were used as a KDF to generate credentials in such a way that the server never sees the user’s password.

I’m wondering if okta was inspired by those.

> Bcrypt is a password hash, not a KDF

I feel gross calling a function that just blatantly ignores part of its input a hash, much less a password hash. It's like calling a rock a fish, because they're both in water, despite the lack of swimming. In any case, a hash that ignores some of its input is certainly not a cryptographically secure hash, so why is it being used for crypto?