Hacker News new | ask | show | jobs
by charlesdaniels 1769 days ago
At the end of the day, even if you assume good faith on Google's part (which I think is quite a leap), causing the user to present more entropy to the site will make them easier to fingerprint.

256 topics would be ceil(log2(256)) = 8 bits of entropy

30,000 topics would be ceil(log2(30000) = 15 bits of entropy

As a reminder, there are ~ 10 billion people on earth, so if you have 34 bits of entropy or so, you can uniquely identify each person.

So really, the way to think of this as "Google considers making FLoC 20% less effective at fingerprinting users", and that's not even considering other sources of entropy, like user agent or screen size.

5 comments

> and that's not even considering other sources of entropy, like user agent or screen size.

As a reminder: Chrome sends 16 bits of x-client-data with every http request aimed at Google servers. So they already have half the bits they need to uniquely identify your system without FLoC.

The X-Client-Data header is only for evaluating the effect of changes to Chrome, not ad targeting: https://www.google.com/chrome/privacy/whitepaper.html#variat...

Earlier comment with more details: https://news.ycombinator.com/item?id=27367482

(Disclosure: I work on ads at Google, speaking only for myself)

>The X-Client-Data header is only for evaluating the effect of changes to Chrome, not ad targeting

Just like the 2fa phone numbers that people to facebook were for "security purposes" only, but later turned out they were using it for ad targeting?

https://techcrunch.com/2018/09/27/yes-facebook-is-using-your...

Saying "well they could be lying" kinda makes the whole discussion moot doesn't it? Why even bother talking about FloC because they could be lying about that too.
Because neither Facebook nor Google is a single monolithic decision-maker that understands what it itself is doing. Instead, they're fragmented organizations with many different groups with competing interests and goals within them.

More concretely, I think it's easy to believe that:

- The Facebook software developers and product managers who originally built and promoted phone 2FA were being earnest when they said the data would never be used for advertising.

- Some number of years later, someone elsewhere in the organization successfully got themselves access to that information without the knowledge/approval of the first group of people--who in all likelihood don't even work at Facebook anymore--and broke that original promise.

Throwing your hands up in the air and crying "well if they're lying, then all is for naught!" ignores the fact that large organizations act in complex ways, and even if you assume good faith on behalf of the current set of actors, you still need to push for systems which remain ethical and safe if some future set of actors turns out to be complete scumbags.

Irrespective of whether they're telling the truth or lying, saying Chrome sends 16 bits of x-client-data that can be used to identify you means Chrome sends 16 bits of x-client-data that can be used to identify you.
Exactly. The protocols need to not depend on “because I said so” or “pinky swear”.

This is the same problem with Apple’s new SpywareKit.

FLoC is open source in Chromium. They're not lying about that. What they do with Google-specific information originating from Chrome is where skepticism applies.
It's understandable that people have concerns about a HTTP header that could be used as part of a fingerprinting system.

Is there an option in Chrome to opt-out of this data being sent to Google?

I don't really understand this mindset. Google controls Chrome. They openly track every page you visit and show it to you at https://myactivity.google.com/ .

It's possible that Google is tracking you with FLoC or with extra HTTP headers or whatever. But they're also openly tracking you all the time anyway because you use Chrome. If you don't trust them to use the data they collect responsibly, don't use Chrome. (I'm not saying you shouldn't pressure Chrome to collect less data, I'm saying it doesn't make any sense to theorize about secret HTTP header fingerprinting operations when they're making literally no effort to hide the much bigger data collection operation right in front of you.)

>They openly track every page you visit and show it to you at https://myactivity.google.com/ .

what if I'm not signed into chrome, or have that feature disabled?

My guess is they are still tracking you. Kind of like how fb creates shadow profiles for people who don't have an account. So they do all the same tracking, you just don't get to see it in a nice dashboard!
>256 topics would be ceil(log2(256)) = 8 bits of entropy

Unless several topics can be assigned to a person (which seems to be implied in the article), in which case that's 256 bits of entropy available to classify each person.

>As a reminder, there are ~ 10 billion people on earth, so if you have 34 bits of entropy or so, you can uniquely identify each person.

Yeah, well theoretically you could. But that assumes that browsers are able to extract and balance some very arbitrary and very specific information from the browsing habits of all people on earth in a perfect decision tree.

In practice, lots of browsing habits overlap, making this decision tree far less discriminating and powerful than the theoretically optimal one.

Though I think you are absolutely correct that in practice the number of bits to build up a classifier able to uniquely classify each person must be pretty low. Maybe a few hundreds.

That may very well be possible with those 256 topics mentioned in that article.

Also I don't understand the difference between cohorts and topics, apart from the fact that topic are less numerous and can have appealing names?

> Unless several topics can be assigned to a person (which seems to be implied in the article), in which case that's 256 bits of entropy available to classify each person.

Good catch, forgot this was a bit-vector not a single key.

> Yeah, well theoretically you could. But that assumes that browsers are able to extract and balance some very arbitrary and very specific information from the browsing habits of all people on earth in a perfect decision tree.

Not really, people have found in the past that combinations of user agent, screen resolution, installed fonts, installed extensions, and things of that sort can come very close to uniquely identifying individual people.

> Though I think you are absolutely correct that in practice the number of bits to build up a classifier able to uniquely classify each person must be pretty low. Maybe a few hundreds.

Exactly. It might not narrow it down to one person, but perhaps a relatively small pool.

I imagine that advertisers wouldn't have access to the entire bit vector.

Google would also have limit the number of bits an advertiser has access to.

I'm not exactly sure what you're doing with your math there, but I think you probably should include what you think is the current entropy of your browsing sessions...

Considering you're already aware of screen size and user agent, and other forms of fingerprinting, you should probably realize that in the pre-FLoC world, you're likely already 100% identified by numerous ad networks.

While your general assertion is true (log2(10billion) ceil'd is indeed 34) it is also misleading. For it assumes that that your identifier will be almost completely uniformly distributed. That is a very hard to achieve, in fact every team that's trying to track users is effectively trying to solve this problem.
Where can I read an introduction to the concept of "entropy" in the sense that you're using it?

I understand it's an information-theoretical concept, and also understand it's somehow related to randomness, but I'm not sure exactly how, and I would like to have a more precise understanding.

I'm going to copy something I sent to a 13 year old to explain entropy in simple terms. It came up when we were talking about encryption. Reading forwards goes from dense/mathematical to conceptual; reading sections in reverse order does the opposite. This probably won't be useful to you but I have found it useful in other situations.

N bits of entropy refers to 2^n possible possible states.

Cryptanalysis:

AES-128 has a key size of 128 bits, so there are 2^128 possible AES-128 keys. A brute-force attack capable of testing 2^128 keys can break any AES-128 key with certainty.

Fingerprinting:

If a website measures your "uniqueness", saying "one in over 14 thousand people" isn't a great way to measure uniqueness because that number changes exponentially. Since we're dealing possible states, i.e. possible combinations of screen size, user-agent, etc., we instead take the base-2 logarithm of this to get a count of entropy bits (~13.8 bits).

Thermal physics:

The second law of thermodynamics states that spontaneous changes in a system should move from a low- to a high-entropy state. Hot particles are far apart and moving a lot; there are many possible states. Cold particles are moving around less and can't change as easily; there are fewer possible states. Heat cannot move from cold things to hot things on its own, but it can move from hot things to cold things. Think of balls on a billiards table moving apart rather than together.

Entropy of the whole universe is perpetually on the rise. In an unimaginably long time, the most popular understanding is that particles will all be so far apart that they'll never interact. The universe will look kind of like white noise. And endless sea of random-like movement, where everything adds up to nothing, everywhere and forever.

> A brute-force attack capable of testing 2^128 keys can break any AES-128 key with certainty.

One minor caveat: You have to be able to recognize when you've found the right key. If the message is short (less than the key size) then it is likely that there are multiple keys that can decode the ciphertext to a plausible message and you have no way to know which one was correct. This is why an ideal One-Time Pad is considered unbreakable even by brute force: For any possible message of size less than or equal to the ciphertext there exists a key which will decode the ciphertext into that message.

This is a wonderful overview — thank you for writing it! It really helps navigate some of the more dense mathematical introductions out there.
Assuming light background in introductory probability, try the first 4 pages of this: https://web.stanford.edu/~montanar/RESEARCH/BOOK/partA.pdf
Spot on, thank you kindly!