|
|
|
|
|
by dantiberian
4479 days ago
|
|
Nice catch. I'm not so sure about: A simple fix will be just crawling the links without the request parameters so that we don’t have to suffer.
Many links would fail/have different content if the request parameters were removed from the URL. Perhaps the crawler could use some kind of reverse bloom filter [1] to be more careful/back off if it receives the same content from multiple URLs. However nothing is simple at Google scale so there are probably issues with this approach too.[1]: http://www.somethingsimilar.com/2012/05/21/the-opposite-of-a... |
|