|
It is trivial to de-anonymize if records are linkable, which is the case you mention on Dark Data DEFCON25. Another famous case was the de-anonymization of the Netflix data set. However, you are assuming that HumanWeb data collection is record-linkable, which is not the case, precisely to avoid this attack. If what is being collected is linkable: e.g. (user_id, url_1), ... (urser_id, url_n). No matter how you anonymize user_id, it will eventually leak. A single url containing personal identifiable information, e.g. a username, will compromise the whole session. No matter how sophisticated the user_id generation is. The real problem, privacy-wise, is the fact that record can be linked to the same origin. An attacker (or the collector) has the ability to know if two records have the same origin. The anonymization of HumanWeb, however, ensures that linkability across data points is not present. Hence, an attacker cannot know if two records come from the same origin. As a consequence, the fact that one url might give away user data, for instance a username, it would not compromise all the urls sent by that person. If you are interested in more details I recommend this article: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i... [Disclaimer I'm one of the authors] |
Anyway, it's great that Cliqz did this work and I don't want to diminish it, I'm just very cautious when companies claim they're only collecting anonymous data, there were just too many cases in which promises have been broken.