| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Bartweiss 2328 days ago

And even the ones who do practice decent anonymization are generally contributing to the problem just by holding a lot of data.

Lots of companies are content to stop at "our data can't be linked back to a person's identity", which doesn't prevent building a uniquely-identifying user profile. (e.g. via browser fingerprinting, plus enough metadata to associate a user's computer and phone accounts.) Even if they do better than that, its typically "our data is not uniquely identifying in isolation", which still isn't enough. If your differential privacy model says that these four pieces of data have a specificity of 10,000 possible individuals, that's a good start. But if someone with an individual's PII and three of those keys comes looking, they can still narrow down information about the fourth value from your aggregates.

And even if no one screws up, what happens when someone queries a half dozen differential datasets for different subsets of a uniquely identifying key? It's something like the file-drawer problem, where one researcher hiding bad data is malicious, but a dozen studies failing to coordinate produces the same result innocently. If outright failures to anonymize become rarer, cross-dataset approaches become more rewarding.