Hacker News new | ask | show | jobs
by creese 2043 days ago
How would anyone block a crawler? A crawler is just a headless browser.
1 comments

Note that robots.txt is a hint to well-behaved crawlers, not blocking them in any regard.

You can block crawlers if you can identify them, but reliably identifying them is hard.

We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.
This helps to verify that a bot that announces itself as google bot is indeed a google bot. It doesn’t help identify a bot that pretends to be a user/browser.