Hacker News new | ask | show | jobs
by chc 2546 days ago
This thread is about what behavior we should design crawlers to have. One person said crawlers should disregard noindex directives on government sites, and you replied that they should ignore all robots.txt directives and just crawl whatever they can. If you intentionally ignore robots.txt, that has intent, by definition.
1 comments

Not intentionally ignore it by going out of their way to override it, just not be required to implement a feature to their crawler. Apparently parsing those sounds tricky with edge cases. Ignoring that file is absolutely on the table. People of course can adhere to but it's not required and in my opinion shouldn't even be paid attention to.

In my younger years the only time I ever dealt with robots.txt was to find stuff I wasn't supposed to crawl.