Hacker News new | ask | show | jobs
by alienallys 2441 days ago
This assumes there can be many prefix matches for any given url hash.

How many times (probability) do you think a 32-bit hash prefix matches more than one 256-bit hash? Very very unlikely; it's one-to-one practically speaking.

1 comments

Nearly 100% probability. Someone else[0] already did the ballpark math and came up with ~10,000 unique URLs per 32-bit hash prefix if you use the number of pages indexed by Google. My math has it coming out to be ~10 billion per 32-bit hash prefix. I simply did: 30,000,000,000,000/(36*8), but I might be doing something wrong, happy to be corrected.

The issue is, as mentioned in that comment, with visiting a handful of pages you can get a clearer picture of similar URLs that are being returned among these hash prefixes, which can be used to build a profile of your browsing history.

[0]: https://news.ycombinator.com/item?id=21255223

Why are you doing 36*8? It look likes 26 + 10?

You got 32 bits, thus 2^32 = 4 294 967 296 possibilities

30 000 000 000 000 / 4 294 967 296 = 6984.9193

That doesn't seems to me like a 100% probability at all.

You're right, I was using 0-9A-Z, when it should have been 0-9A-F. I still need some coffee :)

So for 6,984 URLs per 32-bit hash, wouldn't that be evenly distributed since it's the result of a hashing function? Therefore we'd expect fairly close to 6,900 URLs per prefix? In what situation would you expect a 1-to-1 of 32-bit prefix to URL? Note: happy to be disproven, this is not my specialty at all.

> when it should have been 0-9A-F.

We are working in bits, use bits instead, that will avoid theses kinds of mistakes.

> In what situation would you expect a 1-to-1 of 32-bit prefix to URL?

Oh yeah sorry I misunderstood it, yeah it's pretty unlikely that you would get 1 url for a prefix(but still possible). I would have to get out my old probability books to find that out but it's not worth it, the probability would be way too tiny.

I thought it was about certainty to be able to match it with the real URL. In theory it would takes only a few page hit to be certain of the domain and thus the URL (if there's no unknown string in the URL).

> We are working in bits, use bits instead, that will avoid theses kinds of mistakes.

K.