| HN Mirror

We don't publish research papers, here's a good article from another privacytech startup (not ours) that discusses some of the shortcomings of differential privacy:

https://medium.com/@francis_49362/dear-differential-privacy-...

Here's my simple take: Imagine you want to protect individuals by using differential privacy when collecting their data. Imagine you want to publish datapoints that each contain only 1 bit of information (i.e. each datapoint says "this individual is member of this group"). To protect the individual, you introduce strong randomization: In 90 % of cases you return a random value (0 or 1 with 50 % probability), and only in 10 % of the cases you return the true value. This is differentially private and for a single datapoint it protects the individual very well, because he/she has very good plausible deniability. If you want a physical analogy, you can think of this as adding a 5 Volt signal on top of 95 Volt noise background: For a single individual, no meaningful information can be extracted from such data, if you combine the data of many individuals you can average out the noise and gain some real information. However, averaging out the noise also works if you can combine multiple datapoints from the same individual, if those datapoints describe the same information or are strongly correlated. An adversary who knows the values of some of the datapoints as context information can therefore infer if an individual is in the dataset (which might already be a breach of privacy). If the adversary knows which datapoints represent the same information or are correlated he can also infer the value of some attributes of the individual (e.g. learn if the individual is part of a given group). How many datapoints an adversary needs for such an attack varies based on the nature of the data.

Example: Let's assume you randomize a bit by only publishing the real value in 10 % of the cases and publish a random (50/50) value in the other cases. If the true value of the bit is 1, the probability of publishing a 1 is 55 %. This is a small difference but if you publish this value 100 times (say you publish the data once per day for each individual) the standard deviation of the averaged value of the randomized bits is just under 5 %, so an adversary who observes the individual randomized bits can already infer with a high probability the true value of the bit. You can defend against this by increasing the randomization (a value of 99 % would require 10.000 bits for the standard deviation to equal the probability difference), but this of course reduces the utility of the data for you as well. You can also use techniques like "sticky noise" (i.e. always produce the same noise value for a given individual and bit), in that case the anonymity depends on the secrecy of the seed information for generating that noise though. Or you can try to avoid publishing the same information multiple times, this can be surprisingly difficult though, as individual bits tend to be highly correlated in many analytics use cases (e.g. due to repeating patterns or behaviors).

That said differential privacy & randomization are still much more secure than other naive anonymization techniques like pure aggregation using k-anonymity.

We have a simple Jupyter notebook that shows how randomization works for the one-bit example btw:

https://github.com/KIProtect/data-privacy-for-data-scientist...