Hacker News new | ask | show | jobs
by dpkonofa 2437 days ago
You’re ignoring that it’s not the full hash, just the prefix. The prefix could potentially match millions of URLs many of which would be duplicated with random pages from other URLs. You would need a very, very extensive model of every single IP that was mapped to each person and that would not be trivial. That’s exactly why the actual matching is done on the client and not somewhere else.
2 comments

> Unless the requested hashes are saved for every single user,

It's 4 bytes by request. Google is keeping my whole history, much more than 4 bytes by request, and they do it for advertisers. I have no trouble believing a company partially owned by the government could afford 4 bytes by request.

Let say it's a billion address for each 4 bytes. You do 2 requests, how much of them will be on both list? Let's be generous and say half! You would only have to visit 30 uniques pages on that URL to find the domain. How often do you go on 30 different pages of a website? I feel it's quite regularly.

That's for a billion matches for each 4 bytes prefix, in reality it would be much less than that and there would certainly be much less than 50% matches between each prefix.

You can even do it in reverse even more easily. Go to a bunch of forum that you want to silence the users. Find their URL which allow to post on it. Get the 4 bytes prefix. Now you got a bunch of timestamped comments with username, and they are most likely pretty unique in the Tencent database. Now you found which IP is for which username.

Is it a hash of the prefix of the url? Or a prefix of the hash? Either way, it's deterministic so it's just a more collisiony hash.

On short enough time frames an IP is often a good enough approximation for a person.