Hacker News new | ask | show | jobs
by redox99 11 days ago
You get downvoted for these opinions but I agree. Most people that complain that their servers get hammered by AI bots are those that run very unoptimized servers that can only handle like 100 rps. I've never had any issues with any of my moderately optimized websites. A $10 VPS can handle sooo much traffic.
2 comments

I think people get annoyed when it's suggested they spend time optimising or even re-writing their websites to handle high traffic loads just to cater to AI bots ripping their content.

It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.

The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.

They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.

What I do is have more strict rate limits for non logged in users. You tell them to log in if they hit the rate limit. For non logged in users, you have a rate limit not just for IP, but also for /24 and /16. Forget about IPv6, IPv4 scarcity is a feature not a bug.
The bot I had was using unique IPs for each request. Some were from cloud providers but most were just random residential ISPs. I couldn't see any obvious connections so rate limiting would've had to be a global rate limit.

Similar to the one SQLite had: https://www2.sqlite.org/forum/forumpost/7d3eb059f81ff694?t=h

Each IP only makes ~1 request though so easy to detect after the fact.

I guess they will run out of IPs at some point so maybe if I had logged each one forever and shown a challenge only to them, it would have fixed it eventually. Just depends how big their pool of IPs is.

You were getting 1k rps, and each request was from an unique IP? So after an hour you got hit by 3.6M different IPs? And all from uncorrelated /16s? That seems hard to believe. Not that I don't believe you, it's just hard for me to grasp that whoever was scraping you had such a large and distributed swarm.
This is called rotating residential proxy service. You can buy it off grey market sites that are probably getting it from botnet operators. It costs about $2-$5 per GB.
Interesting, that definitely seems to be it.
Curious, but how do the bots figure out the combinations? Or do you have links to the diffs from other sites? I assume the diff takes two files in query parameters or something.
I'm not 100% sure but I think links. There's a bunch on the history and revision pages. Yeah, the diff URL has two revision ID's as parameters.

I did try removing some of the links without success. I guess once they have them they just keep checking.

There really isn't a good reason for a wiki (or git host) to provide diffs between arbitrary revisions to unauthenticated users. Limit it to diffs compared to previous (which can be cached) and this problem goes away.

In any case, such labyrinths of expensive dynamically generated pages are no excuse for subjecting people requesting the start page to bot checks.

I see many mediawiki wikis (like the Arch Linux wiki) using anubis succsefully. It can be configured to only act on certain paths.
I managed to solve my scraper problems without optimizing much, but if I had to optimize I think the only option might be "don't use mediawiki" and that's an extremely obnoxious solution. Though maybe I could get there by throttling specific kinds of pages.