Hacker News new | ask | show | jobs
by ThePhysicist 3103 days ago
Personally I wouldn't be that pessimistic about data anonymization. It's entirely possible to robustly anonymize low-dimensional data sets and restrict the information gain of an attacker to a given value even when he/she has information about all non-sensitive attributes in the data set. When using e.g. k-anonymity (with additional l-diversity or better and t-closeness criteria) the resulting data is very robust against attacks, given you correctly specify your sensitive attributes. Of course there are more things to keep in mind, e.g. when repeatedly anonymizing different versions of the same data set (as this can cause data leakage).
3 comments

K-anonymity provides very little protection, if any. A few brief points:

1. I've never seen a formal definition of security that k-anon supposedly satisfies. While I personally really like formal guarantees, maybe one might argue this wouldn't be so bad absent concrete problems with the definition. Which leads us to...

2. K-anon doesn't compose. The JOIN of 2 databases, each k anonymized, can be 1-anonymous (i.e., no anonymity), no matter what k is.

3. The distinction between quasi-identifiers and sensitive attributes (central to the whole framework) is more than meaningless: is misleading. Every sensitive attributes is a quasi-identifier given the right auxiliary datasets. Using k anon essentially requires one to determine a priori which additional datasets will be used when attacking the k anonymized dataset.

4. My understanding of modified versions (diversity, closeness, etc) is less developed, but I believe they suffer similar weaknesses. The weaknesses are obscured by the additional definitional complexity.

(Edit: typos and autocorrect)

1. As I said most people don't use plain k-anonymity as it can leak information about the sensitive attribute when the values of this attribute in a group are (almost) all the same. This is why extensions like l-diversity and t-closeness exist: l-diversity ensures that in each group there will be at least l different values of the sensitive attribute, t-closeness ensures that the resulting distribution of the sensitive attribute values in a group is close (as e.g. measured by the "earth mover's distance") to the distribution of the sensitive attribute in the entire dataset. Given the original data and the anonymized data sets it's pretty easy to measure the information gain (e.g. using a Bayesian approach) of an attacker if he/she knows in which group a given person is. In that sense k-anonymity (with l-diversity/t-closeness) can be analyzed in a formal context just like e.g. differential privacy.

2. Yes that's what I mentioned at the end, k-anonymity is not different from most other techniques here: If you use differential privacy with the Laplacian mechanism and repeatedly publish independently anonymized versions of the same underyling data you will leak information (as an attacker will be able to average the released values in order to get an estimate of the true value).

3. Yes sensitive attributes are often quasi-identifiers as well (at least in combination with other quasi-identifiers), they are treated differently because the underlying risk model does not regard a (non-sensitive) quasi-identifier as something that needs to be protected. Inferring e.g. your gender from your zip code, age and body weight using an anonymized data set is (usually) not considered problematic, whereas learning that you are HIV-positive would (almost always) be problematic, hence the distinction. Also, sensitive attributes are treated as a group when applying k-anonymity, i.e. if we have two binary attributes (HIV, Syphilis) one applies the anonymization criteria to the combinations of the attributes ((true,true), (false, true), (true, false), (false, false)), not individually to each attribute (as this can cause information leakage).

4. I honestly don't know what to reply to this, as l-diversity/t-closeness are well specified methods that were designed to overcome the (known) limitations of k-anonymity. Yes, these methods are not completely trivial to use, but if used correctly they can provide good and quantifiable protection. Not using them since they are hard to implement correctly is like saying we shouldn't use cryptographic algorithms like RSA because it's hard to get all the implementation details right.

You are correct in theory, but there are so many conditions in your answer that you have really proven GP's point.

Even if you make sure that information gain about an individual from your dataset is minimal, this could easily change if combined with other data sets, as GP stated.

Saying we shouldn't use these techniques because it's hard to implement them correctly is like saying we shouldn't use cryptographic methods like RSA because they are hard to understand. You don't need to roll your own version of RSA to use it, and you don't need to implement your own k-anonymity/l-diversity/t-closeness implementation to anonymize your data (see e.g. [1] for an open-source tool to do this).

[1]: http://arx.deidentifier.org/

Is there any entity in the known universe with both the right incentives and at the same time is not completely incompetent and also are willing to spend enough resources on this - while not making any mistakes?

Anyone claiming to do this needs to be verified, that means that it has to be open. And being open does not by any stretch imply that it has been verified. And I will not do that just to use your product/site.

Bottom line: Just abandon and ignore anyone claiming to anonymize sensitive data.