Hacker News new | ask | show | jobs
by ThePhysicist 2325 days ago
Most companies still don’t know what anonymization means and confuse anonymized with pseudonymized or masked data.

Part of the problem is that there are still no good criteria available to define anonymity. Concepts like differential privacy are a step in the right direction but they still provide room for error, and in many cases they are either too restrictive (transformed data is not useful anymore) or too lax (transformed data is useful but can be easily re-identified).

4 comments

It's not that most of them don't know what anonymization is or are confused about it.

Society is a tapestry of bullshit and low-level swindling is generally tolerated or quickly forgotten about. Thus, there's nothing to prod the unprincipled in charge to do the right thing. As long as something seems to be good(anonymized, in this cage), and problems can be hidden behind the corporate veil long enough, the unwritten rule is to half-ass security solutions because, well, security is boring and there's other things to devote company time and resources to(that will advance upper management).

Security measures, especially those that protect the users, don't make money. At best, they're insurance against the fallout that might occur when it's revealed that your company has been silently screwing people over. Like most human beings, businesses often put off serious consideration of the future in order to enjoy quick and immediate gain.

I wouldn't put it past most companies to screw up an approach like differential privacy. Not enough people actually care that much.

Security measures, especially those that protect the users, don't make money.

This is why the government has to make regulations with teeth in this space (of course, the government could be the "unprincipled in charge" you referred to).

> of course, the government could be the "unprincipled in charge" you referred to

Not specifically, but I suppose I wouldn't say that politicians are more or less principled than corporate executives. I know some would argue otherwise, but I'm too black pilled at this point to have faith in any "public servant".

Nevertheless, government regulation is probably the way to actually address these issues. Government may lock competence or will, but at least it provides us some leverage, little it may be.

And even the ones who do practice decent anonymization are generally contributing to the problem just by holding a lot of data.

Lots of companies are content to stop at "our data can't be linked back to a person's identity", which doesn't prevent building a uniquely-identifying user profile. (e.g. via browser fingerprinting, plus enough metadata to associate a user's computer and phone accounts.) Even if they do better than that, its typically "our data is not uniquely identifying in isolation", which still isn't enough. If your differential privacy model says that these four pieces of data have a specificity of 10,000 possible individuals, that's a good start. But if someone with an individual's PII and three of those keys comes looking, they can still narrow down information about the fourth value from your aggregates.

And even if no one screws up, what happens when someone queries a half dozen differential datasets for different subsets of a uniquely identifying key? It's something like the file-drawer problem, where one researcher hiding bad data is malicious, but a dozen studies failing to coordinate produces the same result innocently. If outright failures to anonymize become rarer, cross-dataset approaches become more rewarding.

As one step to raise awareness about the differences I really like this overview:

https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-...

Having read about anonymization techniques I have started to believe that definitions of anonymity and pseudo-anonymity are well settle by now but criteria that contributes to the invariants for performing data transformation are not, so the result is that this criteria fail to guide the implementations of the transformations.

You keep data because data is economically valuable, but even when you care enough to implement some techniques that depends on the invariants you still fail to achieve something the better because of scale and because you don't want to refine the techniques. This also means that somehow somebody may have a technique that, provided enough pieces of data, can reverse you transformation.