Hacker News new | ask | show | jobs
by lvh 2949 days ago
You're right, but there are safer constructions to do this. Maybe this kind of knowledge will get more popular now that GDPR is mandating it :)

Active concern for me: GDPR will promote a bunch more homegrown looks-fine-but-actually-busted crypto schemes. I don't think GDPR will be used to enforce that even in the case of breach, and I'm not sure it should -- I think we should make better schemes available instead.

2 comments

What is the 'safer construction' to do this? I'm looking for ideas and trying to solve a problem. My understanding of the GDPR, which is very basic, supports the view that hashing email addresses is at least questionable. On the other hand, if an email list is a core function, de-spamming seems valid.
I posted a comment with an alternative construction plus rationale: https://news.ycombinator.com/item?id=17153329
An appropriately tuned bloom filter would probably suffice.
A Bloom filter is an interesting approach, but the problem is that the attacker and you need the same property: to know if an email is in the set. If you could tell set membership with (effectively) perfect accuracy the Bloom filter may improve performance but not privacy.

I posted an alternative construction elsewhere in the thread.

The difference is that you may be willing to accept a much higher false-positive rate than your attacker can. This is the same idea behind the old "flip a coin, and then raise your hand if either the coin came up heads or you have [embarrassing problem]" method to statistically count everyone with the embarrassing problem, without disclosing anyone's status with certainty. That's the same property your truncated hash achieves.

A Bloom filter could also be designed accordingly. I'm guessing this post's grandparent was thinking of the filter's natural false-positive rate, or you could add deliberate noise.

If had a service where I wanted people to use it and only remember explicitly opt-out, I wouldn't want any false positives to the "have already opted out" question.
Let's say my list has 10^4 members, and there are 10^9 people worldwide. If I design for a 10^-4 false positive rate, then a list constructed by reverse-engineering my algorithm (whether it's a Bloom filter or a truncated hash or anything else) will be 91% false positives, 9% true positives. That's not a huge improvement, but I could imagine applications where someone judged it worth the ~one customer I inconvenience.

This raises fun questions of what it means to disclose a fact, when you're disclosing it probabilistically. Let's say that you tell me the yes/no answer to a question you consider private. I then generate a uniform random number X on [0, 1], and disclose (("you told me yes") || (X >= a)) for some agreed constant a.

If a = 1, then I've almost surely just disclosed your secret. If a = 0, then I've almost surely disclosed nothing. At what value of a do you start to care? That's a really messy question, depending on the social consequences of the information being disclosed (what fraction of innocent candidates would you reject to make sure your child's tutor isn't on the list of clients of a psychologist known for treating pedophiles?), and the other public information about you and about my population that an attacker can fuse to make a stronger estimate.

I don't think privacy-through-false-positives is a terribly effective tool. It's just the only possible tool for creating privacy when your rule is public (whether deliberately or after a breach)--so it's interesting to think about places where it could have some benefit.

Using a hash function for this is about as good as it's going to be. If we accept the conjecture of the existence of one-way functions, and use a cryptographic sound one-way function and implementation, it's provably the best we can do.
That's not provably the case at all. We can absolutely do better, for reasons the root of this thread raises. If that were true, why isn't SHA256 the state of the art in password storage? Reason: the input space is small enough to enumerate.

I posted a comment with an alternative construction plus rationale: https://news.ycombinator.com/item?id=17153329

I never said anything about SHA256. I talked about one-way functions.

The thread model is an adversary that gets unlimited access to the values stored for this purpose, and knows the function used to compute it. He wants to check if a given email is in the set. One-way functions is provably the best way to be able to ask yes/no to the question if this email is in the set with no false answers. I have not said anything about using computationally expensive one-way functions because that does not matter if the function takes 10 seconds to compute. He already knows what emails he wants to check.

You did say hash function.

Would you mind jotting down that proof?

A proper cryptographic hash function is a one way function, if they exist.

But I'd frame your question the other way around then. You do not want to store the emails in a form that leaks any data. For that we need a compressing function. My (unwritten) assumption was that if the adversary compromise the system to get the data, they'll get any secrets too. This means that a HMAC is no better than a cryptographic hash function.

I know that is quite possible to create a system where this would be significantly harder than just a DB dump. But that is both significantly more difficult, and expensive. I'll admit that the formulation "provable the best we can do" should've had a big fat asterisk with the disclaimer about the threat model.

So, if an attacker have the data set and secrets, and wants to compute if a particular input is a member of this set. Can you do better than a cryptographic hash function?

Can you write down in pseudocode exactly what you're suggesting? As in db.write(sha256(email)) or whatever.