Hacker News new | ask | show | jobs
by Xylakant 2043 days ago
Note that robots.txt is a hint to well-behaved crawlers, not blocking them in any regard.

You can block crawlers if you can identify them, but reliably identifying them is hard.

1 comments

We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.
This helps to verify that a bot that announces itself as google bot is indeed a google bot. It doesn’t help identify a bot that pretends to be a user/browser.