Hacker News new | ask | show | jobs
by kjeetgill 2441 days ago
> Unless the requested hashes are saved for every single user,

I don't know why or what part you think this is that hard. Do you think a map from User -> set [Requested URL Hashes] is hard to build? Or that building the URL Hash -> set [possible domains] is hard?

Maybe I'm missing a piece of this.

Building something simple to start guessing domains visited seems pretty easy. If a user has 10 URL hashes and the same domains show up in each hashes' possible domains you're probably requesting pages on that domain. If you're lucky and all the pages from a domain fall into a single hash, all it takes is two or 3 hashes from known outbound links to show up to confirm this.

It's not foolproof but hardly infeasible? Or maybe I don't fully understand the algorithm.

1 comments

You’re ignoring that it’s not the full hash, just the prefix. The prefix could potentially match millions of URLs many of which would be duplicated with random pages from other URLs. You would need a very, very extensive model of every single IP that was mapped to each person and that would not be trivial. That’s exactly why the actual matching is done on the client and not somewhere else.
> Unless the requested hashes are saved for every single user,

It's 4 bytes by request. Google is keeping my whole history, much more than 4 bytes by request, and they do it for advertisers. I have no trouble believing a company partially owned by the government could afford 4 bytes by request.

Let say it's a billion address for each 4 bytes. You do 2 requests, how much of them will be on both list? Let's be generous and say half! You would only have to visit 30 uniques pages on that URL to find the domain. How often do you go on 30 different pages of a website? I feel it's quite regularly.

That's for a billion matches for each 4 bytes prefix, in reality it would be much less than that and there would certainly be much less than 50% matches between each prefix.

You can even do it in reverse even more easily. Go to a bunch of forum that you want to silence the users. Find their URL which allow to post on it. Get the 4 bytes prefix. Now you got a bunch of timestamped comments with username, and they are most likely pretty unique in the Tencent database. Now you found which IP is for which username.

Is it a hash of the prefix of the url? Or a prefix of the hash? Either way, it's deterministic so it's just a more collisiony hash.

On short enough time frames an IP is often a good enough approximation for a person.