| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nkurz 2753 days ago

Now that edit window is over, I finally noticed the massive error in my wording in the second to last sentence. Instead, let's pretend I wrote "why would we want Id_zi to equal zero when the population is less than the threshold?" It would also be good to note again that I haven't read the paper closely, and very well might be misinterpreting what it is doing.

---

But since I'm still in this edit window, I'll add an update here. I downloaded the per zip population data from here: https://blog.splitwise.com/2013/09/18/the-2010-us-census-pop.... Then I wrote a quick Perl program (parallel to Cook's Python simulation) but using the actual per zipcode populations rather than a fixed average. After confirming Cook's 84% number with the fixed population, I ran it on the actual populations (but still with a flat distribution for age and sex) and got 63% uniques.

Presumably this number would drop somewhat further with actual age distributions, but I don't know how far exactly. My current belief is that Sweeney's paper does a good job of calling attention to the fact that risk of identification is high, but the methodology and exact numbers should not be trusted. The actual percentage of Americans identifiable by (zip, dob, sex) is large, but something less than 63%. It might be interesting to run the simulation with actual age bracket data, but I didn't find this in any easy to download format, so I think I'll stop here.