|
|
|
|
|
by randomstring
869 days ago
|
|
I wonder if you could construct a hash collision for high pagerank sites in the google (or Bing) index. You would need to know what hash algorithm google uses to store URLs. This is assuming that they hash the URLs for their indexing. Which surely they do. MD5 and SHA1 existed when google was founded, but hash collisions weren't a big concern until later IIRC. You'd want a fast algorithm because you're having to run your hashing algorithm on every URL you encounter on every page, and that adds up quickly. The max legal length of URLS is 2048, but I wouldn't be surprised if there aren't plenty of non-compliant URLs longer than that in the wild. If you were limited to 2048 characters, and a valid URL format, I suspect it would be hard if not impossible to build a URL with the same MD5 of an arbitrary high ranking URL like "https://nytimes.com/" But what if you just wanted to piggy back the pagerank of any mid to high rank site? Is there a URL in the top million ranked URLs you could MD5 hash collide? I doubt google would use a URL hash as strong and as slow as MD5. Maybe Page and Brin weren't even thinking about cryptographic hashes, and just a good mixing function. |
|