| HN Mirror

It may have gotten lost in the revisions of my comment, but I was consciously not trying to argue that Sweeney's numbers were wrong, only that Cook's explanation was lacking since it doesn't discuss distribution. I hadn't yet looked at the paper.

That said, looking at the paper you linked now, I don't see how Cook's simulation (simulation, not explanation) and Sweeney's paper can both be correct. Cook got 84-85% identifiable assuming uniform age distribution and identical population per zipcode. At the bottom of Figure 14, Sweeney says 87% for the US as a whole.

Shouldn't any non-uniformity (of zip code population or age clustering) act only to reduce the percent of the population that is identifiable? That is, shouldn't Cook's simulation with flat age distribution and equal zipcode populations be an upper bound on identifiability? Since Cook's simulation code looks fine, this makes me suspect that there's something off about Sweeney's analysis.

Is the 87% perhaps an average of the state percentages, and not properly weighted by state population? Or maybe an average across age classes not weighted by population of that class? Oh, I don't know about those, but maybe I see a bigger issue now...

In Section 4.3.1, Sweeney defines the "Number of subjects uniquely identified in a subdivision of a geographical area". But this isn't a simulation like Cook did, she's just using a binary yes/no depending on whether the subpopulation in each age class exceeds a numerical threshold:

  if population(zi, a) ≥ |Qa|, then ID_aZi= population(zi, a)   
  else ID_aZi = 0.

While it's nice that it's clearly defined, I don't think this yields a "percent identifiable" that matches up with Cook's simulation, nor with any common usage of the term. Also (while I'm being picky) isn't the definition backward? If we were to go with this arbitrary definition, wouldn't we want Id_zi to equal zero when the population is less than the threshold? I presume the direction of equality is just a typo, but if the paper is using a hard threshold rather than some more rigorous approach like Cook's simulation, this seems like a major flaw in interpretability of the results.