Hacker News new | ask | show | jobs
by antpls 2484 days ago
After reading all your links, I'm still not sure why or where Differential Privacy is needed.

1) How could aggregated data (means, average, min max) be used by attackers? Aren't aggregated data already private? For example, the Google postgres extension returns aggregated data, why is DP required here?

2) In the case of sharing entire databases, if all the PII are removed, why does it matter that we can match two records from two databases? Yes we can do correlation between 2 databases, but if PII were not gathered and stored at all in any database, there would be no privacy issue in the first place.

1 comments

Good questions =)

1) Note that the "min/max" example trivially leaks individual information: for example, releasing the max salary of employees of a company leaks the salary of the CEO. More generally, there have been numerous attacks on privacy notions purely based on aggregate data. One of my favorite is this one: https://blog.acolyer.org/2017/05/15/trajectory-recovery-from...

2) Typically, PII is not the only thing that can be used to reidentify someone, and matching records from different databases can sometimes infer sensitive information about people. One example: https://www.cs.cornell.edu/~shmat/netflix-faq.html

If you haven't already...

Applying differential privacy to that Netflix case study would be a terrific exercise.

I'm still not convinced, but I guess I'm lacking critical technical background to grasp it.

1) The CEO example isn't really a good one to me, given the wealth inequalities in the world, leaking CEO's salary is almost desirable... I tried to read the blogpost and paper about mobile data location. At one point they talk about aggregated data, but then in the paper : "This dataset is collected by a major mobile network operator in China. It is a large-scale dataset including 100,000 mobile users with the duration of one week, between April 1st and 7th, 2016. It records the spatiotemporal information of mobile subscribers when they access cellular network (i.e., making phone calls, sending texts, or consuming data plan). It also contains anonymous user identification, accessed base sta- tions and timestamp of each access.". So... the data is not really "aggregated"? The dataset literally lists some user IDs.

2) If I'm fired because my boss didn't like my history of movie, then it can probably be defended in court, depending on the country. I could also find another boss who has a natural sense of ethic and who doesn't judge me for what I watch.

Thank you for the links anyway. I will look at them again in a few day to see if I missed something

"The CEO example isn't really a good one to me, given the wealth inequalities in the world, leaking CEO's salary is almost desirable... "

Your value judgement about a potential attack vector doesn't disqualify that it is an attack vector.