Hacker News new | ask | show | jobs
by shoeboxam 1556 days ago
Consider Alice's census response as a row in a table of responses. You can reconstruct every attribute/column for which a sufficiently large number of summary statistics are released.

Names are not a part of the statistics the Census releases, so you won't be able to reconstruct the name. Make a fingerprint out of some of the reconstructed attributes and run a database join against another dataset with names. You've now enriched your data with any remaining attributes that were not a part of your fingerprint.

I'm grateful no one has attacked and shared the 2010 decennial census, or ACS, which has considerably more questions. If it seems far-fetched... well, you only need one person to do it, there's an existence proof, and the attack is basically just a convex optimization problem.

As to what can specifically be in those columns, it's pretty narrow for the decennial, but the ACS is more broad. Check out the ACS summary data: https://data.census.gov/cedsci/table?q=United%20States

Extra: I've written some reconstruction attacks like this myself. One approach is to find the least squares solution to |Ax - b|. Let b be a vector of Census statistics. A is a query matrix. An attacker has both. Solve for x, which is a column in the dataset. If b is long enough, then the system is fully determined, and you can solve for x exactly. In practice, b can be much smaller than x for the system to reconstruct x with high accuracy. Repeat for each column. There are more efficient approaches, like SAT solvers, for large systems.

Happy to talk more about the general approaches used to privatize census data if anyone is curious. I don't work on the Census mechanisms, but I do privacy research and am familiar with DP hierarchical histogram mechanisms.

See also Abowd's summary of the attack: https://blogs.cornell.edu/abowd/special-materials/245-2/