Hacker News new | ask | show | jobs
by user9383781 2529 days ago
I've always thought something like robots.txt was a bit silly when it's so easy to ignore.
5 comments

robots.txt is supposed to be helpful to the robot too.

If you write a crawler, you probably don't want it to waste time indexing a list of articles in every possible sort order, trying all "reply" buttons, things like that.

For me, a "Disallow" line in robots.txt means "don't bother, nothing interesting here". It is a suggestion that benefits everyone when followed, not an access control list.

>If you write a crawler, you probably don't want it to waste time indexing a list of articles in every possible sort order, trying all "reply" buttons, things like that.

On the other hand, many websites (like wikipedia here) hide interesting pages behind a Disallow.

I think the concern is more accurately: you must go out of your way to honor robots.txt.

I think both robots.txt and security.txt are great ideas. However, they will always only be useful to those who follow the wishes of the website (which hopefully outweight those who do not).

Google, Bing, and other large search engines honor it. That alone is the point of most of the entries in this file.
robots.txt is the sheet with the house rules on the wall (or part thereof), not the enforcement of those rules.
Part of my monthly maintenance on an independent Mediawiki install is to cross-reference our robots.txt (which is based on Wikipedia's) and server logs.

If a client or IP range is misbehaving in the server logs, it goes into robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny list.

I've tried to automate that process a few times but haven't ever gotten far. It's unending, though. Feels like all it seems to take is a handful of cash and a few days to start an SEO marketing buzzword company with its own crawler, all to build yet another thing for us to block.

It's like putting up this sign in a public restroom (slightly nsfw):

http://i.imgur.com/d9L6I59.jpg

It's a bit like "do not track". The people you want to use it for don't care and won't respect it (even if publicly they might say they will).