Hacker News new | ask | show | jobs
by throwawaygoog10 2135 days ago
You seem to be missing what differential privacy is. It's not about collection of data, it's about the _use_ of that data. It's no secret that Google has an incredible amount of logging data, but the ways we can use it are very limited. Folks seem to be under the impression that we can wily-nily just go ahead and build products that harvest everything about you and link up the dots across organizations. That's so funny, because it'd make things so much easier sometimes. :P

Instead, we have very strict privacy rules and experts to review the designs for the use of this data. If I even want to train a ML model over real data I have to have an approved privacy review that shows how you maintain privacy.

Where I use differential privacy algorithms in my line of work is to do ad-hoc analysis over suggestions placed in front of users. I have dimensions to aggregate across, but I want to ensure that no one bucket can deanonymize a user. k-anonymity used to be the thing (e.g. if a bucket has <50 people in it, that's too few), but even a large bucket can deanonymize users which is where k-anonymity comes in. I sincerely don't care who the users are, I just want to know how our features get used to try and save them more time.

Do I have access to the underlying logs? Yes. Can I use that to make decisions? No. I can however use the anonymized data to make decisions, and even store that longer than the underlying data exists (most logs exist for <14d).

Differential privacy also makes it possible to train models like SmartCompose by ensuring that the tokens it trains over are diffuse enough to not point back to any one person.

> I'd personally bet that differential privacy techniques that actually give users notable information-theoretic anonymity are very rarely used by Google in general.

For existing things, sure. They did their best, but this is new, reified research. As they're replaced they're being replaced by features which use differential privacy techniques.

1 comments

I appreciate the quality response. A lot of the focus here seems to be 'prevent other consumers from finding things out about our users', which is good and important. I usually think more about it from Google's perspective, which is that they have the data, and perhaps they're not using it for X right now, but they have the potential to, and that potential is what creates this significant power imbalance and centralization that I'm often concerned over.

Obviously Google employees cannot go around reading+using all of my personal communications for whatever they want to, but just that Google has all of them, to me, is too much power given to a single actor, even if they are generally not abusing this power.

With those said, differential privacy is still a great tech, so it's still great that they're open-sourcing and encouraging things like this. But I'll likely remain concerned about the centralization of the world's data at the same time.