| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zmarty 1951 days ago
	A lot of news websites restrict any crawler other than Google. And this does not happen only via robots.txt.

1 comments

simias 1951 days ago

Indeed, years ago I had scripts to automatically fetch URLs from IRC and I quickly realized that if I didn't spoof the user agent of a proper web browser many websites would reject the query. Googlebot's UA worked just fine however.

link

judge2020 1951 days ago

> Googlebot's UA worked just fine however

They obviously don't care enough then - Google says you should use rdns to verify that googlebot crawls are real[0]. Cloudflare does this automatically now as well for customers with WAF (pro plan).

0: https://developers.google.com/search/docs/advanced/crawling/...

link