Hacker News new | ask | show | jobs
by solso 1927 days ago
It is trivial to de-anonymize if records are linkable, which is the case you mention on Dark Data DEFCON25. Another famous case was the de-anonymization of the Netflix data set.

However, you are assuming that HumanWeb data collection is record-linkable, which is not the case, precisely to avoid this attack.

If what is being collected is linkable: e.g. (user_id, url_1), ... (urser_id, url_n). No matter how you anonymize user_id, it will eventually leak. A single url containing personal identifiable information, e.g. a username, will compromise the whole session. No matter how sophisticated the user_id generation is. The real problem, privacy-wise, is the fact that record can be linked to the same origin. An attacker (or the collector) has the ability to know if two records have the same origin.

The anonymization of HumanWeb, however, ensures that linkability across data points is not present. Hence, an attacker cannot know if two records come from the same origin. As a consequence, the fact that one url might give away user data, for instance a username, it would not compromise all the urls sent by that person.

If you are interested in more details I recommend this article: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

[Disclaimer I'm one of the authors]

1 comments

I still see a lot of ways in which users could be de-anonymized, sometimes a single URL is already sufficient and side channels like the quorum mechanism might leak information as well. Maybe it's really anonymous, but personally I don't trust any mechanism that doesn't have a statistical anonymity guarantee, differential privacy being the preferred one as it's the only anonymity model that hasn't been broken yet.

Anyway, it's great that Cliqz did this work and I don't want to diminish it, I'm just very cautious when companies claim they're only collecting anonymous data, there were just too many cases in which promises have been broken.