Hacker News new | ask | show | jobs
by lvh 2949 days ago
Yes. They don't quite define how the hash works in the post, but assuming it's something like SHA256(email), that's easy to enumerate.

There are ways to do this better. Let's say that it's 1 party and you're trying to figure out if you've seen en email address before. (That's the case in the article, there are also schemes where you and another entity can figure out if you both saw any email addresses -- but that's not what we're discussing here.)

We already know how to take relatively low entropy things and store them securely to see if you've seen them before, for password storage! However, password storage works a little differently. You _know_ which entry you're checking against because you have a secret (password) but also an identifier (user name) -- so you can recompute against the same random key. This randomization means attackers need to try every password for every user. This doesn't work for us, because we just have an email, but it's close.

Three parts worth considering: KDF, PRF and truncation. Firstly, your (deterministic, for reasons mentioned above) PRF turns your low-entropy input into a higher-entropy key. But (again, for reasons mentioned above) attackers still just have to try every email should they compromise your database. You can fix that problem by also adding a PRF (pseudorandom function) that you rate-limit vigorously. Think of a PRF as a keyed hash -- the usual example is HMAC-SHA256. If you're capable of keeping PRF key material safe but might leak a database dump (not unreasonable), the PRF forces the attack to be online: an attacker can only validate guesses as long as they have access to the PRF, and the PRF comes with audit trails and rate limits.

Finally, you can choose to truncate the output. Because the output space of your PRF will be much, much larger than the input space of email addresses, a match out of the PRF gives you almost perfect certainty that you've seen the email address before. That goes for you, and an attacker. If, let's say, you have another way to validate if you've seen the user before (but it's expensive, say, you have an encrypted offline dataset but it's AES-GCM'd and you can't afford to decrypt the entire thing every time), truncation gives you a neat way to _probabilistically_ say if you've seen an address before.

1 comments

> You can fix that problem by also adding a PRF (pseudorandom function) that you rate-limit vigorously. Think of a PRF as a keyed hash -- the usual example is HMAC-SHA256. If you're capable of keeping PRF key material safe but might leak a database dump (not unreasonable), the PRF forces the attack to be online: an attacker can only validate guesses as long as they have access to the PRF, and the PRF comes with audit trails and rate limits.

That particular part is assuming security through not knowing the implementation of the security models components, aka. security through obscurity. Rule no. 1 in security, always assume that the adversary knows exactly how everything is implemented and can do that for himself.

What? A PRF having a secret key is not “security by obscurity.” I am documenting how the entire process works and where the security properties come from.