Hacker News new | ask | show | jobs
by privacylawthrow 2026 days ago
For data to be anonymous under GDPR, it is not enough that individuals cannot be identified from the anonymized data set. If individuals can be identified when the anonymous data set is compared with the source data set, the anonymized data is not "anonymous".

For data to be truly anonymous under GDPR. there must be no other additional data that would allow for reidentification. If there is any other data that, when combined with the anonymous data, allows for reidentification, the data set is only pseudonymous and must be treated as personal data under GDPR.

6 comments

Can someone explain the point of this requirement? If a malicious actor has access to the source data there's no need to compare it to anonymized data. What am I missing?
It's an easy-to-state largely foolproof test to see if data really is anonymized.

The thing that you're worried about with poorly-anonymized datasets is that if you have another non-anonymized dataset you can combine them to deduce the original information. "Your data set must not be able to be combined with any others that would allow them to infer the original data" is hard. How could you possibly test them all?

Well it turns out that there is one such non-anonymized dataset with the property that if you can't connect your anonymized data with it at all then you can be pretty sure that you couldn't connect them with any others -- the original data!

Let's say you're doing a study of fingerprint patterns. You anonymize a collection of fingerprints from a non-anonymized source by stripping everything but the fingerprint images. Because fingerprints are unique it seems like it'd be impossible to meet the GDPR criteria; even if the only thing that was left was the fingerprints, when compared against the source dataset they will be identified. a) is this interpretation accurate? b) if so, it seems that there's large swaths of data that can never be in compliance. What are the implications for medical research, for instance?
I think you nailed it that some data can’t really be anonymized. How could you anonymize emails, names, social security numbers, DNA samples?

You don’t have to use anonymized data all the time, it’s just that the requirements for handling and passing around such data is lower.

I don't understand the point though; if someone has the source data, what good is the anonymized data to them? What value is added by requiring more stringent safeguards on data that can't be anonymized this way?
If someone makes inferences on the de-identified data, or joins it against another dataset. The source dataset lets those inferences or joins be tied back to the original identifying data.

The main point is that de-identified data can still be "personal" so it's regulated. If you share or make public psuedonymous data, that data is still covered by GDPR so you have to inform the individuals, have a legal basis (such as consent), let them opt out (if applicable), etc. Even if it's been pseudonymized, I would want to know if/when my data is sold to a marketing firm or whatever.

> The source dataset lets those inferences or joins be tied back to the original identifying data.

But if the attacker lacks the source dataset, they can't do this, and if they possess the source dataset, they'd use it for their analysis rather than using the anonymised dataset.

The point is that if the attacker can connect your user record in the source data with user # 188da24a7789d in the "anonymized" data, they can use that de-identify all information derived or built on the "anonymized" data.

Oh, there is Netflix account for user # 188da24a7789d and the IRS released tax summaries for user # 188da24a7789d? That's interesting, since I know that user # 188da24a7789d is really MaxBarraclough.

If a dataset removes all information except for, say, a user's fingerprints, meaning the only information stored in the anonymous dataset is an image of a fingerprint. The nature of fingerprints prevents them from meeting this requirement, as stated, which effectively eliminates any research that can be done with the data. Given that the only way the dataset could be linked to the original user is if an attacker already had access to the source data, how is this regulation benefiting anyone?
Privacy and security are not the same. Security is to protect against malicious actors. Privacy is to protect data from everyone that’s not the person PII itself.
Google's Pair group has a great explainer here: https://pair.withgoogle.com/explorables/anonymization/
That's the most concise and clear formulation of that I've seen so far. Thanks.
k-anonymity is often only applied to "pseudoidentifiers", if you have the original dataset it'd be trivial to reverse k-anonymity applied that way. For example someone's blood pressure isn't considered an identifying variable, and would not need to be anonymised (should not too, to keep data utility high), however this would make linking against the original dataset trivial.
You are right, time series data like BPM over time does not lend itself to anonymization nicely, the provider most likely will have to ask the user organizations what kind of measures (features) they need and return an average (if that's what the receiving organisation was after) that itself can be k-anonymized.
Averaged time series are very different than individual ones.

This is a deep problem; it's basically unavoidable in e.g. medical research - the very factors you want to study may well be potentially identifying. The only way to address this is to balance the potential utility of the research against the potential impact of the information.

In my experience, this is a question of interpretation (see e.g. Recital 26 and the question of what is "reasonably likely"). You can ask ten different experts, and you will get ten different opinions.

Unfortunately, many aspects of the GDPR are interpreted very heterogeneously, both in individual countries and by different supervisory authorities within the countries themselves.

For this reason, it is essential that more specific guidelines and certifications are developed for the use of different technologies, including anonymization.

> In my experience, this is a question of interpretation (see e.g. Recital 26 and the question of what is "reasonably likely").

This is absolutely true. The hard part is that was it "reasonably likely" changes as technology changes. It's entirely possible that a data set that qualifies as anonymous today will not be anonymous in 5 years. Organizations are responsible for the data they publish. If data loses its anonymity in the future due to release of other data sets and/or improved technology, the organization releasing the data will be responsible for the release of personal data, even if it wasn't personal data at the time of release.

True. For this reason, even anonymous data can usually not be shared as open data. You have to control the environment in which the data is used to control what is "reasonably likely" (see also comment by La1n above).
Also this interpretation would completely block any sharing within the pharmaceutical field, where the original data is required by law to be kept for a minimum of 25 years. I personally like the definitions from UKAN, which talk about anonymous data as relating to data environments.

edit: https://msrbcel.files.wordpress.com/2020/11/adf-2nd-edition-...

Absolutely. They're doing a great job at UKAN!
Although what you say makes sense, which is the respective GDPR rule? I don’t recall seeing something like this.