Hacker News new | ask | show | jobs
by zwily 2949 days ago
An appropriately tuned bloom filter would probably suffice.
1 comments

A Bloom filter is an interesting approach, but the problem is that the attacker and you need the same property: to know if an email is in the set. If you could tell set membership with (effectively) perfect accuracy the Bloom filter may improve performance but not privacy.

I posted an alternative construction elsewhere in the thread.

The difference is that you may be willing to accept a much higher false-positive rate than your attacker can. This is the same idea behind the old "flip a coin, and then raise your hand if either the coin came up heads or you have [embarrassing problem]" method to statistically count everyone with the embarrassing problem, without disclosing anyone's status with certainty. That's the same property your truncated hash achieves.

A Bloom filter could also be designed accordingly. I'm guessing this post's grandparent was thinking of the filter's natural false-positive rate, or you could add deliberate noise.

If had a service where I wanted people to use it and only remember explicitly opt-out, I wouldn't want any false positives to the "have already opted out" question.
Let's say my list has 10^4 members, and there are 10^9 people worldwide. If I design for a 10^-4 false positive rate, then a list constructed by reverse-engineering my algorithm (whether it's a Bloom filter or a truncated hash or anything else) will be 91% false positives, 9% true positives. That's not a huge improvement, but I could imagine applications where someone judged it worth the ~one customer I inconvenience.

This raises fun questions of what it means to disclose a fact, when you're disclosing it probabilistically. Let's say that you tell me the yes/no answer to a question you consider private. I then generate a uniform random number X on [0, 1], and disclose (("you told me yes") || (X >= a)) for some agreed constant a.

If a = 1, then I've almost surely just disclosed your secret. If a = 0, then I've almost surely disclosed nothing. At what value of a do you start to care? That's a really messy question, depending on the social consequences of the information being disclosed (what fraction of innocent candidates would you reject to make sure your child's tutor isn't on the list of clients of a psychologist known for treating pedophiles?), and the other public information about you and about my population that an attacker can fuse to make a stronger estimate.

I don't think privacy-through-false-positives is a terribly effective tool. It's just the only possible tool for creating privacy when your rule is public (whether deliberately or after a breach)--so it's interesting to think about places where it could have some benefit.