Hacker News new | ask | show | jobs
by Paul-ish 3446 days ago
I'm curious what techniques they will use to anonymize the data. I would guess some sort of differential privacy technique.
1 comments

I hope so, but I would be a little surprised if they used DP. I am more inclined to think they will take a 'traditional' (weaker in terms of provable privacy protection) approach.
It's totally reasonable to consider doing something like this with differential privacy. The techniques exist, but it would still be pretty brave of them. They are certainly aware of DP.

Example technique: if you treat each record as a (src,dst) pair of (lat,lon) pairs or somesuch, you can then build a 4d grid whose cells you populate with (Laplace) noisy counts. This provides eps-differential privacy when the noise is roughly 1/eps.

Whenever counts are sufficiently large, you can refine the contents of the cell and ask again. If you do the refinement at most k times, you get k*eps-differential privacy. There are smarter ways that work even better.

These provide "trip privacy", meaning they mask the presence/absence of individual trips. Uber presumably has user identifiers, and could group all trips by one user together, and do the same counting where the weight of each trip is scaled down so that they sum to at most one for each user. This would then give "user privacy", meaning it masks the presence/absence of individual users.