Maybe, but they have a good reason to keep that data, and they even go out of their way to "hide it" the best they can using a one-way function.
To save the information that a certain email address has explicitly withdrawn consent, they need to store it. The alternative is to send out a new email the next time someone adds then. I think the interpretation of GDPR this particular instance of information storing is still open, but they have done everything possible to keep it safe. Should the list of hashes be leaked, the best an adversary can realistically do is check known emails against the list of hashes.
You're right, but there are safer constructions to do this. Maybe this kind of knowledge will get more popular now that GDPR is mandating it :)
Active concern for me: GDPR will promote a bunch more homegrown looks-fine-but-actually-busted crypto schemes. I don't think GDPR will be used to enforce that even in the case of breach, and I'm not sure it should -- I think we should make better schemes available instead.
What is the 'safer construction' to do this? I'm looking for ideas and trying to solve a problem. My understanding of the GDPR, which is very basic, supports the view that hashing email addresses is at least questionable. On the other hand, if an email list is a core function, de-spamming seems valid.
A Bloom filter is an interesting approach, but the problem is that the attacker and you need the same property: to know if an email is in the set. If you could tell set membership with (effectively) perfect accuracy the Bloom filter may improve performance but not privacy.
I posted an alternative construction elsewhere in the thread.
The difference is that you may be willing to accept a much higher false-positive rate than your attacker can. This is the same idea behind the old "flip a coin, and then raise your hand if either the coin came up heads or you have [embarrassing problem]" method to statistically count everyone with the embarrassing problem, without disclosing anyone's status with certainty. That's the same property your truncated hash achieves.
A Bloom filter could also be designed accordingly. I'm guessing this post's grandparent was thinking of the filter's natural false-positive rate, or you could add deliberate noise.
Using a hash function for this is about as good as it's going to be. If we accept the conjecture of the existence of one-way functions, and use a cryptographic sound one-way function and implementation, it's provably the best we can do.
That's not provably the case at all. We can absolutely do better, for reasons the root of this thread raises. If that were true, why isn't SHA256 the state of the art in password storage? Reason: the input space is small enough to enumerate.
I never said anything about SHA256. I talked about one-way functions.
The thread model is an adversary that gets unlimited access to the values stored for this purpose, and knows the function used to compute it. He wants to check if a given email is in the set. One-way functions is provably the best way to be able to ask yes/no to the question if this email is in the set with no false answers. I have not said anything about using computationally expensive one-way functions because that does not matter if the function takes 10 seconds to compute. He already knows what emails he wants to check.
It seems to me this counts only as pseudonymisation and not anonymization. While the hash is not directly readable, it's still reversible with additional information (such as a large list of email addresses and knowledge of the hashing algorithm).
One of the GDPR notes says:
> [p]ersonal data which have undergone pseudonymization, which could be attributed to a natural person by the use of additional information, should be considered to be information on an identifiable natural person”
Consider that you are running some kind of controversial/embarassing site of sexual/political/other sensitive nature. You keep a hashes of people who once were users but unsubscribed or something like that.
If that database is leaked, a user could re-hash list of political figures, celebrities or just some big list of well known email addresses and with this information find out they were users of this sensitive site.
So to me it seems that pseudoanymized/hashed emails still count as PII and have to be treated as such.
It cannot be reversed without a significant amount of effort (really, even when you say the "entropy is quite small" it's not actually as small as you would think) and is therefore probably reasonable. Worst case a regulating body will tell you that no, they do not think "this will take 1-10 years to reverse" is quite good enough and then you can work with them on a solution that would be good enough.
Could you walk me through how you come to that conclusion? I admit my estimate was very ballpark, but "minutes" seems so wildly out of line with what I think I must be making a mistake somewhere.
A single AWS GPU server can hash trial passwords on the order of 100 GH/s, which puts a pretty low ceiling on "hashcat as a service" rental costs.
I'm assuming 10^12 tries per second is economical for any business.
there are about a million words, including all likely spellings of all but the rarest first and last names, so all 1 or 2 word addresses, firstname.lastnames, etc. addresses are about 10^12. try those, plus short alphanumerics, for the 1000 most common email domains -> 10^15 addresses
Throw in every name in public leak databases that doesn't meet those patterns as well.
There's on the order of 1 million domains that are likely to be serving mail at all; try the billion most likely names for each of those for another 10^15.
This should capture almost every email address that isn't an intentionally obfuscated one-off and adds up to less than an hour at 10^12/sec. There's a modest overhead to matching against a larger list but it shouldn't matter in practice
The practical entropy of email addresses is indeed pretty small, lots of them are going to be first.last@company.example and a bunch more end in gmail.com or another popular provider.
If you can accept some level of false positives you could make the hash too narrow to be able to usefully reverse it. For example if only sixty people will ever subscribe or refuse to subscribe,a 24-bit hash is plenty to reject mistaken attempts to subscriber somebody who doesn't want in, but good luck guessing which GMail user is "2ca24b".
Another problem is, what if the email address changes hands - maybe even the whole email domain changed ownership. You probably need a way for people to change their minds, as that then also covers the case where the person behind the address changed.
Yes. They don't quite define how the hash works in the post, but assuming it's something like SHA256(email), that's easy to enumerate.
There are ways to do this better. Let's say that it's 1 party and you're trying to figure out if you've seen en email address before. (That's the case in the article, there are also schemes where you and another entity can figure out if you both saw any email addresses -- but that's not what we're discussing here.)
We already know how to take relatively low entropy things and store them securely to see if you've seen them before, for password storage! However, password storage works a little differently. You _know_ which entry you're checking against because you have a secret (password) but also an identifier (user name) -- so you can recompute against the same random key. This randomization means attackers need to try every password for every user. This doesn't work for us, because we just have an email, but it's close.
Three parts worth considering: KDF, PRF and truncation. Firstly, your (deterministic, for reasons mentioned above) PRF turns your low-entropy input into a higher-entropy key. But (again, for reasons mentioned above) attackers still just have to try every email should they compromise your database. You can fix that problem by also adding a PRF (pseudorandom function) that you rate-limit vigorously. Think of a PRF as a keyed hash -- the usual example is HMAC-SHA256. If you're capable of keeping PRF key material safe but might leak a database dump (not unreasonable), the PRF forces the attack to be online: an attacker can only validate guesses as long as they have access to the PRF, and the PRF comes with audit trails and rate limits.
Finally, you can choose to truncate the output. Because the output space of your PRF will be much, much larger than the input space of email addresses, a match out of the PRF gives you almost perfect certainty that you've seen the email address before. That goes for you, and an attacker. If, let's say, you have another way to validate if you've seen the user before (but it's expensive, say, you have an encrypted offline dataset but it's AES-GCM'd and you can't afford to decrypt the entire thing every time), truncation gives you a neat way to _probabilistically_ say if you've seen an address before.
> You can fix that problem by also adding a PRF (pseudorandom function) that you rate-limit vigorously. Think of a PRF as a keyed hash -- the usual example is HMAC-SHA256. If you're capable of keeping PRF key material safe but might leak a database dump (not unreasonable), the PRF forces the attack to be online: an attacker can only validate guesses as long as they have access to the PRF, and the PRF comes with audit trails and rate limits.
That particular part is assuming security through not knowing the implementation of the security models components, aka. security through obscurity. Rule no. 1 in security, always assume that the adversary knows exactly how everything is implemented and can do that for himself.
What? A PRF having a secret key is not “security by obscurity.” I am documenting how the entire process works and where the security properties come from.
I honestly can't see this being something that would ever result in enforcement action.
They have a legitimate business interest in not spamming someone if people try to sign you up multiple times, and since the email address is hashed, all they can use it for is to determine if they've sent you an invite before (and potentially when they did so, or when you declined the invitation).
Maybe they could get in trouble if they also retain information on who is trying to send you invites and creating a graph and a shadow profile based on this type of information, but it sounds pretty clear that this isn't something they're doing or are interested in doing.
To save the information that a certain email address has explicitly withdrawn consent, they need to store it. The alternative is to send out a new email the next time someone adds then. I think the interpretation of GDPR this particular instance of information storing is still open, but they have done everything possible to keep it safe. Should the list of hashes be leaked, the best an adversary can realistically do is check known emails against the list of hashes.