Hacker News new | ask | show | jobs
by TedTed 2484 days ago
Hi, I'm one of the authors of the scientific paper¹ linked in this blog post. Incidentally, I wrote a series of blog posts explaining differential privacy in layman's terms. The first post might be "not-technical-enough" for HackerNews, but maybe the next in the series make up for it. Feedback welcome =)

- https://desfontain.es/privacy/differential-privacy-awesomene...

- https://desfontain.es/privacy/differential-privacy-in-more-d...

- https://desfontain.es/privacy/differential-privacy-in-practi...

- https://desfontain.es/privacy/almost-differential-privacy.ht... (describes a core intuition behind the system described in our paper)

- https://desfontain.es/privacy/local-global-differential-priv...

I also think Section 2 of the paper should be readable by most folks with a basic understanding of SQL and differential privacy.

¹ https://arxiv.org/abs/1909.01917

7 comments

Thank you. I'll try to digest these. I barely understand the math involved. So I'm pretty sure my "intuition" is wide of the market. I'm very grateful for your efforts to explain, socialize your work.

When I studied cryptographic voting systems, my "aha" moment was realizing the magic sauce is creating hash collisions so that a secure one-way hash can be used to protect voter privacy.

Re-re-reading the differential privacy stuff, this para jumped out:

https://en.wikipedia.org/wiki/Differential_privacy#ε-differe...

"The intuition for the 2006 definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database. Therefore with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed."

Oh. The "differential" part means modeling the difference between data captured and not captured.

I think (hope) this means figuring how much to fuzz the capture data so that hash collisions will match real world fuzzing.

--

Again, I'll continue to try to grok this stuff. Real world (story book style) examples will be very helpful.

Until I do understand, I think it's crucial for crypto and privacy minded people to quantify the assumptions and context involved. When I was working on election integrity (and medical records & guarding patient privacy), all the discussions were just make believe. I did help author a govt report to meant to help quantify the attack surface area for election administration. But I don't think it did much good, nor was it replicable (to new contexts).

So bravo. Please keep going.

After reading all your links, I'm still not sure why or where Differential Privacy is needed.

1) How could aggregated data (means, average, min max) be used by attackers? Aren't aggregated data already private? For example, the Google postgres extension returns aggregated data, why is DP required here?

2) In the case of sharing entire databases, if all the PII are removed, why does it matter that we can match two records from two databases? Yes we can do correlation between 2 databases, but if PII were not gathered and stored at all in any database, there would be no privacy issue in the first place.

Good questions =)

1) Note that the "min/max" example trivially leaks individual information: for example, releasing the max salary of employees of a company leaks the salary of the CEO. More generally, there have been numerous attacks on privacy notions purely based on aggregate data. One of my favorite is this one: https://blog.acolyer.org/2017/05/15/trajectory-recovery-from...

2) Typically, PII is not the only thing that can be used to reidentify someone, and matching records from different databases can sometimes infer sensitive information about people. One example: https://www.cs.cornell.edu/~shmat/netflix-faq.html

If you haven't already...

Applying differential privacy to that Netflix case study would be a terrific exercise.

I'm still not convinced, but I guess I'm lacking critical technical background to grasp it.

1) The CEO example isn't really a good one to me, given the wealth inequalities in the world, leaking CEO's salary is almost desirable... I tried to read the blogpost and paper about mobile data location. At one point they talk about aggregated data, but then in the paper : "This dataset is collected by a major mobile network operator in China. It is a large-scale dataset including 100,000 mobile users with the duration of one week, between April 1st and 7th, 2016. It records the spatiotemporal information of mobile subscribers when they access cellular network (i.e., making phone calls, sending texts, or consuming data plan). It also contains anonymous user identification, accessed base sta- tions and timestamp of each access.". So... the data is not really "aggregated"? The dataset literally lists some user IDs.

2) If I'm fired because my boss didn't like my history of movie, then it can probably be defended in court, depending on the country. I could also find another boss who has a natural sense of ethic and who doesn't judge me for what I watch.

Thank you for the links anyway. I will look at them again in a few day to see if I missed something

"The CEO example isn't really a good one to me, given the wealth inequalities in the world, leaking CEO's salary is almost desirable... "

Your value judgement about a potential attack vector doesn't disqualify that it is an attack vector.

After page 20 of your paper are pages of junk that I don't expect you mean to have there (at least in the PDF version).
Yeah, thanks, I messed up the upload to arXiv >< It'll be fixed tonight with the updated version.
Haven't read the paper yet, but have read the blog posts (which are awesome, BTW!).

I'm wondering if you have any thoughts on Frank McSherry's old blog post expressing his distrust for approximate-DP [1]. He seems to have different intuitions than your "almost DP" post expresses and makes criticisms that aren't quite addressed in your post.

[1]: https://github.com/frankmcsherry/blog/blob/master/posts/2017...

Based on a skim, it seems that this requires individuals (or whatever class of entity we don't want to leak info about) be expressly identified in each record of the queried dataset, eg with a uid column. So for a dataset that one wanted to use this method with that doesn't identify individuals (eg in the web log case, not logged in visitors, or simply not recording logged in user), one would have to heuristically assign a (potentially synthetic) identity for each record first. Is that right?
This was very insightful, thank you.
^ commenter is a security researcher who works for Google.

This whole area of research seems like it exists as a way to rationalize wide-scale data collection. Rather than focusing on the collective rights of all people being tracked, it focuses on the risk that any individual person faces from an attacker.

Apple uses differential privacy and evangelizes its use. DP is not either-or with consent.

Let's say your mapping app says 'Do you want to contribute traffic information during your drive to help provide better navigation experience for everyone' If you click "yes" and opt-in, do you, or do you not, want this to use Differential Privacy?