| This is kinda amusing. robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result." It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites. Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them. That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for. Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest." |