Hacker News new | ask | show | jobs
by gaius 3071 days ago
a case today that forces Google to share all the data that it collects on searches would help break its monopoly.

The only thing worse than Google having everyone’s search history is forcing them to reveal everyone’s search history. The data should be destroyed and new data not gathered.

“You googled X on date D therefore we have concluded Y is a preexisting condition and your claim is denied”

4 comments

Depending on the data, Google anonymizes the data in 30-180 days (most things are ≤34 days, and 180 is reserved for search).
A friendly reminder that there's no such thing as "anonymized data", there's only "anonymized until combined with other data sets".
A textbook example is the AOL search history release [1]. They went to the trouble of wiping user account information but left anonymous (but unique) per-user numeric identifiers. Oops, someone didn't think that one through.

1: https://en.wikipedia.org/wiki/AOL_search_data_leak

How is that an example of one of the failures of anonimization of data? To me it just looks like AOL did a shitty job, not that the concept as a whole is a lost cause.
Most breaches happen because someone "did a shitty job".

The truth is, you're doing a shitty job if you don't recognize anonymization for what it is - essentially trying to have a cake and eat it too. In practice, it has specific constraints that must be met, and I'd judge the difficulty of doing a good job here to be similar to rolling out your own crypto. That is, unless you're a good statistician, you're better off not sharing the data (or not having it in the first place) than releasing it "anonymized".

Disagree. You can anonymize and aggregate data such that any data that could deanonymize would also allow you to fully reconstruct what you're looking at. Then you aren't adding any information.

As an example, a list of Google searches, aggregated at the minute level, is a useful dataset, but it won't tell you anything about my search history unless you already have my search history, in which case you already knew the answer.

To take your example - assume I target you personally, and beyond the list of Google searches, I managed to get hold of the list of times (with minute-or-better precision) you made requests to Google Search (say, I hacked/subpoenaed your ISP). Taken together and if large enough, the two datasets would allow me to build a statistical profile of your possible interests - even though in the original dataset you're bucketed together with lots of people, each time you do the search (second dataset) you're bucketed with different people.

Gaining access to other data - like e.g. your country of residence + aggregated popularity of search terms for each country - would let me refine your statistical profile further.

That would potentially work, but I expect that it would require a dataset of size larger than the average lifespan of a person (2.4 million google searches per minute).
And isn't Googles method of anonymization simply to "blackout" the last octet of each users IP, so in many cases one wouldn't even need much in the way of additional data.
Sounds like incentive to regulate then. That they presently have the data and it's private is worse than it being totally public. If it's acted on now it'll just be by a better connected actor against somebody unaware they can even be acted upon in those ways.
Privacy is an illusion, and measures designed to protect privacy end up protecting the ones that have exclusive access to it.
Two responses.

1. Although I see your point that search queries could be dangerous, Google would never allow such use of search queries. Using search queries from a given person to market them products is one thing, but using it against them is a different beast that would horribly hurt the Google brand. And although one could imagine a court order being used to get a user's history, establishing a pre-existing condition is not enough of a benefit to go through the trouble of getting a court order for search history (and would likely not be granted, given the circumstantial nature of the evidence).

2. There are a hundred reasons to search any given query. If an insurer is using such an insubstantial basis to establish a pre-existing condition, the actual problem is that the insurer has too much market power and can be (for lack of a better word) an asshole without losing customers.

You appear to be replying to a comment without having read or understood any of the context?

The thread is discussing a case in which Google would be forced to release all of this data via court order, so your first bullet doesn't make sense as Google would have no choice in this theoretical scenario.

"2. There are a hundred reasons to search any given query. If an insurer is using such an insubstantial basis to establish a pre-existing condition, the actual problem is that the insurer has too much market power and can be (for lack of a better word) an asshole without losing customers."

So, in other words, acting like a typical US based health insurer before the Affordable Care Act came into law.