Hacker News new | ask | show | jobs
by kleene_op 1769 days ago
>256 topics would be ceil(log2(256)) = 8 bits of entropy

Unless several topics can be assigned to a person (which seems to be implied in the article), in which case that's 256 bits of entropy available to classify each person.

>As a reminder, there are ~ 10 billion people on earth, so if you have 34 bits of entropy or so, you can uniquely identify each person.

Yeah, well theoretically you could. But that assumes that browsers are able to extract and balance some very arbitrary and very specific information from the browsing habits of all people on earth in a perfect decision tree.

In practice, lots of browsing habits overlap, making this decision tree far less discriminating and powerful than the theoretically optimal one.

Though I think you are absolutely correct that in practice the number of bits to build up a classifier able to uniquely classify each person must be pretty low. Maybe a few hundreds.

That may very well be possible with those 256 topics mentioned in that article.

Also I don't understand the difference between cohorts and topics, apart from the fact that topic are less numerous and can have appealing names?

1 comments

> Unless several topics can be assigned to a person (which seems to be implied in the article), in which case that's 256 bits of entropy available to classify each person.

Good catch, forgot this was a bit-vector not a single key.

> Yeah, well theoretically you could. But that assumes that browsers are able to extract and balance some very arbitrary and very specific information from the browsing habits of all people on earth in a perfect decision tree.

Not really, people have found in the past that combinations of user agent, screen resolution, installed fonts, installed extensions, and things of that sort can come very close to uniquely identifying individual people.

> Though I think you are absolutely correct that in practice the number of bits to build up a classifier able to uniquely classify each person must be pretty low. Maybe a few hundreds.

Exactly. It might not narrow it down to one person, but perhaps a relatively small pool.

I imagine that advertisers wouldn't have access to the entire bit vector.

Google would also have limit the number of bits an advertiser has access to.