Hacker News new | ask | show | jobs
by the8472 2529 days ago
robots.txt is the sheet with the house rules on the wall (or part thereof), not the enforcement of those rules.
2 comments

Part of my monthly maintenance on an independent Mediawiki install is to cross-reference our robots.txt (which is based on Wikipedia's) and server logs.

If a client or IP range is misbehaving in the server logs, it goes into robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny list.

I've tried to automate that process a few times but haven't ever gotten far. It's unending, though. Feels like all it seems to take is a handful of cash and a few days to start an SEO marketing buzzword company with its own crawler, all to build yet another thing for us to block.

It's like putting up this sign in a public restroom (slightly nsfw):

http://i.imgur.com/d9L6I59.jpg