|
|
|
|
|
by pinneycolton
2745 days ago
|
|
I work with this type of data and I assure you that the results are quite plausible. The original hypothesis was tested against US census data. See "Experiment B" here: https://dataprivacylab.org/projects/identifiability/paper1.p... I'll add that there are far fewer live births, per day, in the US than there are zip codes. I agree that some highly populated areas that are problematic, but this may be the only reason that 87.1% number isn't 100%! |
|
That said, looking at the paper you linked now, I don't see how Cook's simulation (simulation, not explanation) and Sweeney's paper can both be correct. Cook got 84-85% identifiable assuming uniform age distribution and identical population per zipcode. At the bottom of Figure 14, Sweeney says 87% for the US as a whole.
Shouldn't any non-uniformity (of zip code population or age clustering) act only to reduce the percent of the population that is identifiable? That is, shouldn't Cook's simulation with flat age distribution and equal zipcode populations be an upper bound on identifiability? Since Cook's simulation code looks fine, this makes me suspect that there's something off about Sweeney's analysis.
Is the 87% perhaps an average of the state percentages, and not properly weighted by state population? Or maybe an average across age classes not weighted by population of that class? Oh, I don't know about those, but maybe I see a bigger issue now...
In Section 4.3.1, Sweeney defines the "Number of subjects uniquely identified in a subdivision of a geographical area". But this isn't a simulation like Cook did, she's just using a binary yes/no depending on whether the subpopulation in each age class exceeds a numerical threshold:
While it's nice that it's clearly defined, I don't think this yields a "percent identifiable" that matches up with Cook's simulation, nor with any common usage of the term. Also (while I'm being picky) isn't the definition backward? If we were to go with this arbitrary definition, wouldn't we want Id_zi to equal zero when the population is less than the threshold? I presume the direction of equality is just a typo, but if the paper is using a hard threshold rather than some more rigorous approach like Cook's simulation, this seems like a major flaw in interpretability of the results.