Hacker News new | ask | show | jobs
by tomjen3 640 days ago
This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.

Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.

Edit: updated comment to not be needlessly diversive.

1 comments

It is trivial to detect fake GoogleBot traffic (Google provides ways to validate it) and Cloudflare already does so. See for yourself:

  curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com
They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.
Now try it from a google cloud vm.
Pretty sure that won't work, they let you validate whether an IP address is used by GoogleBot specifically, not just owned by Google in general. I doubt they are foolish enough to use the same pool of IP addresses for their internal crawlers and their public cloud.

https://developers.google.com/search/docs/crawling-indexing/...

It depends how the site has implemented it, a huge number just look for AS origination and *googleuserconent.com