Hacker News new | ask | show | jobs
by drostie 5224 days ago
Maybe, but there would have to also be passes through the data afterward to link identities. Hashing is dangerous since browsers are living, dynamic beasts. When someone updates their browser, their useragent changes, and you'll want to keep their new identity as an extension of their old one. Not to mention that people use multiple browsers. So there's going to be a vital step of "linking the new identity to old ones" which can happen on a different thread more dedicated -- but you'll need to keep data. You'll probably truncate ultralarge fields and then GZIP them or so, rather than just hashing them.

One interesting thought: how much space would you need to pull this off? Chromium generates 12 KB of data which can gzip to 3KB, Firefox generates 5 KB of data which can gzip to a little over 1KB. Truncate-then-gzip could be used to keep perhaps 0 - 4 KB per person. Assume that your average user uses ~2KB. That's still rather a lot, when compared with what you can do with counters -- 8 bytes or so to store. If you wanted to keep your database under 2 TB, you could only handle a million people, not hundreds of millions. So it would really be a big distributed project to link identities as they evolve over time. I imagine that's one huge factor in using tracking cookies; it's lazy for scaling.

2 comments

It reminds me of Latanya Sweeney's work in 1990 that demonstrated that 87% of the US population can be identified by just their gender, zip code, and full date of birth.

An interesting project might be to create a database having a table with the useragent hash as the primary key, and associate each identity in the user table to a number of these useragent hashes.

You could do much better than gzip with a custom compression scheme like the Mailinator guy talked about recently.

http://news.ycombinator.com/item?id=3617074