| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tomjen3 640 days ago

This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.

Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.

Edit: updated comment to not be needlessly diversive.

1 comments

jsheard 640 days ago

It is trivial to detect fake GoogleBot traffic (Google provides ways to validate it) and Cloudflare already does so. See for yourself:

  curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com

They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.

link

matt-p 640 days ago

Now try it from a google cloud vm.

link

jsheard 640 days ago

Pretty sure that won't work, they let you validate whether an IP address is used by GoogleBot specifically, not just owned by Google in general. I doubt they are foolish enough to use the same pool of IP addresses for their internal crawlers and their public cloud.

https://developers.google.com/search/docs/crawling-indexing/...

link

matt-p 640 days ago

It depends how the site has implemented it, a huge number just look for AS origination and *googleuserconent.com

link