Hacker News new | ask | show | jobs
by jkp56 2437 days ago
These hashes also reveal domain names. Most users visit many URLs on a small set of domains. If a user requests 1000 hashes that all can map to Reddit, it's very likely that the user is indeed reading Reddit. Another way to look at it: if the same person appears in a crowd on hundreds of photos, it's trivial to notice that there is something special about this person, even though in all cases the person was k-anonymous.
1 comments

Except that the problem with that is that these hashes likely won't reveal domain names.

For example, the hash for reddit.com/r/gifs would be different from reddit.com/r/funny and so the prefixes would be different for both of them. Unless the requested hashes are saved for every single user, it would be way too computationally expensive for them to get anything useful out of that. Not to mention the fact that hashes would return the same prefix for any thousands of URLs. Narrowing down which domains those URLs are rooted on would be incredibly hard.

> Unless the requested hashes are saved for every single user,

I don't know why or what part you think this is that hard. Do you think a map from User -> set [Requested URL Hashes] is hard to build? Or that building the URL Hash -> set [possible domains] is hard?

Maybe I'm missing a piece of this.

Building something simple to start guessing domains visited seems pretty easy. If a user has 10 URL hashes and the same domains show up in each hashes' possible domains you're probably requesting pages on that domain. If you're lucky and all the pages from a domain fall into a single hash, all it takes is two or 3 hashes from known outbound links to show up to confirm this.

It's not foolproof but hardly infeasible? Or maybe I don't fully understand the algorithm.

You’re ignoring that it’s not the full hash, just the prefix. The prefix could potentially match millions of URLs many of which would be duplicated with random pages from other URLs. You would need a very, very extensive model of every single IP that was mapped to each person and that would not be trivial. That’s exactly why the actual matching is done on the client and not somewhere else.
> Unless the requested hashes are saved for every single user,

It's 4 bytes by request. Google is keeping my whole history, much more than 4 bytes by request, and they do it for advertisers. I have no trouble believing a company partially owned by the government could afford 4 bytes by request.

Let say it's a billion address for each 4 bytes. You do 2 requests, how much of them will be on both list? Let's be generous and say half! You would only have to visit 30 uniques pages on that URL to find the domain. How often do you go on 30 different pages of a website? I feel it's quite regularly.

That's for a billion matches for each 4 bytes prefix, in reality it would be much less than that and there would certainly be much less than 50% matches between each prefix.

You can even do it in reverse even more easily. Go to a bunch of forum that you want to silence the users. Find their URL which allow to post on it. Get the 4 bytes prefix. Now you got a bunch of timestamped comments with username, and they are most likely pretty unique in the Tencent database. Now you found which IP is for which username.

Is it a hash of the prefix of the url? Or a prefix of the hash? Either way, it's deterministic so it's just a more collisiony hash.

On short enough time frames an IP is often a good enough approximation for a person.

No, you are wrong. You don't understand how safebrowsing URL hashes are produced. Read this first: https://developers.google.com/safe-browsing/v4/urls-hashing under the section "Suffix/prefix expressions"
Yes, I do. I linked to that very same site. The prefix is the only thing that’s sent and the hashes are computed as unique for each combination of URL. The docs explicitly say that http://evil.com/foo is unique from http://evil.com/foo?bar.
So any secrets communicated in the URL via e.g. get parameters can be potentially brute-forced based on the hash?
If you have the computing power to brute force sha256 then there are few security precautions a user can take against you.

That opens up forging SSL certificates, signing Debian packages, and a whole slew of other things that you could pull off with more computing power than has existed in the whole of the universe.

Context: https://bitcoin.stackexchange.com/a/41842

Suppose a public PGP key is communicated in the URL.

I don't think there are enough PGP keys in the world that simply trying those keys in a certain prioritized order wouldn't find a lot of keys.

This way you could recognize who someone is talking to based solely on their 'malicious browsing' send hashes.

In general, anytime you know a secret in the URL comes from a low (min-) entropy distribution you can use the truncated SHA-hash as an oracle to (approximately) figure out the secret. Obviously, if the secret is a 128 bit session cookie, you won't find that by brute forcing. Heck, in that case your 32 or 48 bits of hash prefix won't even be enough to uniquely identify the session cookie.

When you get to a 64 bit secret (say again a session cookie), with a 48 bit prefix, things already get dicey. It seems possible that some state actors can already brute force 64 bit secrets, and with a 48 bit prefix, you'd get about 2^16 ~ 64000 (more, but dont wanna compute exactly) candidate session cookies. Those can actually be tried at the website to see if they work.

You're assuming the range of possible parameters is very large. This might not be the case.