Hacker News new | ask | show | jobs
by nappy-doo 3071 days ago
Depending on the data, Google anonymizes the data in 30-180 days (most things are ≤34 days, and 180 is reserved for search).
1 comments

A friendly reminder that there's no such thing as "anonymized data", there's only "anonymized until combined with other data sets".
A textbook example is the AOL search history release [1]. They went to the trouble of wiping user account information but left anonymous (but unique) per-user numeric identifiers. Oops, someone didn't think that one through.

1: https://en.wikipedia.org/wiki/AOL_search_data_leak

How is that an example of one of the failures of anonimization of data? To me it just looks like AOL did a shitty job, not that the concept as a whole is a lost cause.
Most breaches happen because someone "did a shitty job".

The truth is, you're doing a shitty job if you don't recognize anonymization for what it is - essentially trying to have a cake and eat it too. In practice, it has specific constraints that must be met, and I'd judge the difficulty of doing a good job here to be similar to rolling out your own crypto. That is, unless you're a good statistician, you're better off not sharing the data (or not having it in the first place) than releasing it "anonymized".

Disagree. You can anonymize and aggregate data such that any data that could deanonymize would also allow you to fully reconstruct what you're looking at. Then you aren't adding any information.

As an example, a list of Google searches, aggregated at the minute level, is a useful dataset, but it won't tell you anything about my search history unless you already have my search history, in which case you already knew the answer.

To take your example - assume I target you personally, and beyond the list of Google searches, I managed to get hold of the list of times (with minute-or-better precision) you made requests to Google Search (say, I hacked/subpoenaed your ISP). Taken together and if large enough, the two datasets would allow me to build a statistical profile of your possible interests - even though in the original dataset you're bucketed together with lots of people, each time you do the search (second dataset) you're bucketed with different people.

Gaining access to other data - like e.g. your country of residence + aggregated popularity of search terms for each country - would let me refine your statistical profile further.

That would potentially work, but I expect that it would require a dataset of size larger than the average lifespan of a person (2.4 million google searches per minute).
And isn't Googles method of anonymization simply to "blackout" the last octet of each users IP, so in many cases one wouldn't even need much in the way of additional data.