Part of my monthly maintenance on an independent Mediawiki install is to cross-reference our robots.txt (which is based on Wikipedia's) and server logs.
If a client or IP range is misbehaving in the server logs, it goes into robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny list.
I've tried to automate that process a few times but haven't ever gotten far. It's unending, though. Feels like all it seems to take is a handful of cash and a few days to start an SEO marketing buzzword company with its own crawler, all to build yet another thing for us to block.
If a client or IP range is misbehaving in the server logs, it goes into robots.txt. If it's ignoring robots.txt, it gets added to the firewall's deny list.
I've tried to automate that process a few times but haven't ever gotten far. It's unending, though. Feels like all it seems to take is a handful of cash and a few days to start an SEO marketing buzzword company with its own crawler, all to build yet another thing for us to block.