Hacker News new | ask | show | jobs
by skierscott 3184 days ago
I work with data scientists and it’s scary how much they have access to.

One first hand story: this person searched a database to find tax records for a specific individual and their income. The records were de-anonymized but there’s enough data there he could figure it out.

We should design systems to preserve individual privacy. The person I heard the story from had no business in this persons tax records.

3 comments

Can you clarify this term - “de-anonymized”? It reads like the word “anonymised” is what’s meant.
"de-personalized" or "anonymized" would both work as valid interpretations in that paragraph, in case that helps you gain understanding while waiting for a reply.
anonymization is a process supposed to protect privacy by removing directly identifying information from a data set, de-anonymization is the reverse process (identifying someone in a dataset with no directly identifying information), proving that anonymization does not work.

For example, let's take phone location records, just look where the phone is every night of the week and where it goes every morning and you have the home and work address of the owner which is more than enough to identify the phone owner in pretty much all cases.

De-anonymising would be detecting/creating a person's identity from sets of "anonymised" information, I'd assume.
D’oh. I meant anonyomized (I would edit comment but can’t).
It's really nothing new. Relatively-low-level IT staff have been able to access the CEO's emails, mostly unabated, for the past couple of decades now. Very few companies actually have working data protections of any sort, let alone audit trails. I've seen setups where virtually anyone with a computer had access to real, valid personal information, including birthdays, employment history, and raw SSNs, of hundreds of thousands of people. This is 100% the norm.

In the digital age, privacy is not really a thing, and for the most part, anyone semi-integrated in modern society should just expect that virtually all information about themselves, including things like shopping and television watching habits, conversational "metadata" that is not-quite-so-meta, and so forth is quietly being trafficked by all kinds of parties. Only a tiny portion of that traffic is actual identity theft, though each step along the way increases its likelihood.

Do not make the mistake of believing this is exclusively limited to online activity either. Wikileaks has revealed that the government would remotely hijack a common brand of wall-mounted televisions and activate the internal microphone (???) to listen to private conversations, and this has long been a speculated use of cellular phones. Many people willingly stream everything that happens in their home, video and audio, to Google via the NestCam or similar devices.

If you've carried your cell phone on you, BigBroCo knows that you were (or weren't) at church on Sunday, Walmart on Tuesday, and that you spent a suspiciously long time parked in the far corner of a parking lot yesterday.

It's pretty likely that the relevant parties have already deduced which transaction in Walmart's payment systems was yours and have indexed the contents of your shopping run (even more likely if you've explicitly enabled this by using something like Walmart Pay in the Walmart app). Transactions are tied to your profile and it's all shopped around not only through AdWords but private B2B databases that want the information to try to target you with relevant solicitations. This is trivially visible in a variety of ways, since almost everyone is using such programs in some way or another.

It's naive to assume that such gold mines have never tempted anyone, especially when many of the people with an interest in accessing some of that data would also know how to bypass any auditing systems that may exist, and especially when there are many people who'd be happy to grease a few palms for the inside scoop.

We're coming in on the world where interest in the vast trove of personal data expands beyond targeted marketing and ultra-high-level political malfeasance. We have a thorny and potentially very scary road ahead of us.

He probably meant to say deidentified.