| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by creese 2043 days ago
	How would anyone block a crawler? A crawler is just a headless browser.

1 comments

tleb_ 2043 days ago

robots.txt

https://www.robotstxt.org/

https://en.wikipedia.org/wiki/Robots_exclusion_standard

link

Xylakant 2043 days ago

Note that robots.txt is a hint to well-behaved crawlers, not blocking them in any regard.

You can block crawlers if you can identify them, but reliably identifying them is hard.

link

tleb_ 2042 days ago

We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.

link

ddorian43 2041 days ago

https://developers.google.com/search/docs/advanced/verifying...

link

Xylakant 2041 days ago

This helps to verify that a bot that announces itself as google bot is indeed a google bot. It doesn’t help identify a bot that pretends to be a user/browser.

link