| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bobds 5232 days ago
	I don't think hashing by itself means much. Especially when working with tightly-constrained values. How about k-anonymity?

1 comments

jbooth 5232 days ago

Can you explain how hashing is insufficient? You're saying someone could use a lookup table of hashcodes for all the URLs on the internet and deanonymize the URLs? The browser's identity is still unrecoverable.

EDIT: Ah, just looked up k anonymity. We don't store any of the information that they're protecting for, like age, sex, any other personal attributes.

link

bobds 5232 days ago

I'm sure you are looking at more than just URLs.

Say you have a lead form with two fields, email and zip code. You would store a variety of data points besides those two. Referring URL, IP address, useragent, etc. If you just hash everything, and I gain access to your hashed values, it would be easy to make a lookup table reversing the one-way hash, at least for some of the data points.

I haven't had a chance to read the k-anonymity or related papers, but from what I understand it's not specific to data points like age/sex/etc.

link

falcolas 5232 days ago

If you obtain my gender, DOB and zip code (which is not hard - Google demonstrated that had all of that data even though I never gave it to them directly), you can uniquely identify me 80% of the time.

That's insufficient, in my book.

link

jbooth 5232 days ago

I'm a little skeptical of this claim in light of the http://en.wikipedia.org/wiki/Birthday_problem

23 people means a 50% chance that 2 of them share a birthday. Assuming that your gender and zip code put you in a pool of people > 23, I'm not sure how your statistic holds up. Do you have a citation?

link

hackerblues 5232 days ago

I believe the previous commentator was referring to the work conducted by Latanya Sweeney

http://dataprivacylab.org/projects/identifiability/paper1.pd...

"In this document, I report on experiments I conducted using 1990 U.S. Census summary data to determine how many individuals within geographically situated populations had combinations of demographic values that occurred infrequently. It was found that combinations of few characteristics often combine in populations to uniquely or nearly uniquely identify some individuals. Clearly, data released containing such information about these individuals should not be considered anonymous. Yet, health and other person-specific data are publicly available in this form. Here are some surprising results using only three fields of information, even though typical data releases contain many more fields. It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides. And even at the county level, {county, gender, date of birth} are likely to uniquely identify 18% of the U.S. population. In general, few characteristics are needed to uniquely identify a person."

link