Hacker News new | ask | show | jobs
by luckylion 1913 days ago
I usually recommend setting only Google/Bing/Yandex/Baidu etc to Allow and everything else to Disallow.

Yes, the bad bots don't give a fuck, but even the non-malicious bots (ahrefs, moz, some university's search engine etc) don't bring any value to me as a site owner, take up band width and resources and fill up logs. If you can remove them with three lines in your robots.txt, that's less noise. Especially universities do, in my opinion, often behave badly and are uncooperative when you point out their throttling does not work and they're hammering your server. Giving them a "Go Away, You Are Not Wanted Here" in a robots.txt works for most, and the rest just gets blocked.

1 comments

> they're hammering your server

Why can't you just ratelimit IPs that are "too active" for your server to handle?

From some I could, but why would I? If they're not adding value and they don't want to behave, I don't see a reason to spend money to adapt my systems to be "inclusive" towards their usage patterns.
In context, you're justifying blocking all automated traffic, even that which does behave, by pointing out that some of it doesn't. That attitude seems lazy at best, malicious at worst.
CGNAT
Now that's a really good point. I wonder why there isn't a standard protocol for signalling upstream that a particular connection is abusive and to please rate limit the path at the source on your behalf? It would certainly add complexity, but the current situation is hardly better.