| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by redox99 11 days ago
	You get downvoted for these opinions but I agree. Most people that complain that their servers get hammered by AI bots are those that run very unoptimized servers that can only handle like 100 rps. I've never had any issues with any of my moderately optimized websites. A $10 VPS can handle sooo much traffic.

2 comments

CodeBytes 11 days ago

I think people get annoyed when it's suggested they spend time optimising or even re-writing their websites to handle high traffic loads just to cater to AI bots ripping their content.

It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.

The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.

They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.

link

redox99 11 days ago

What I do is have more strict rate limits for non logged in users. You tell them to log in if they hit the rate limit. For non logged in users, you have a rate limit not just for IP, but also for /24 and /16. Forget about IPv6, IPv4 scarcity is a feature not a bug.

link

CodeBytes 11 days ago

The bot I had was using unique IPs for each request. Some were from cloud providers but most were just random residential ISPs. I couldn't see any obvious connections so rate limiting would've had to be a global rate limit.

Similar to the one SQLite had: https://www2.sqlite.org/forum/forumpost/7d3eb059f81ff694?t=h

Each IP only makes ~1 request though so easy to detect after the fact.

I guess they will run out of IPs at some point so maybe if I had logged each one forever and shown a challenge only to them, it would have fixed it eventually. Just depends how big their pool of IPs is.

link

redox99 10 days ago

You were getting 1k rps, and each request was from an unique IP? So after an hour you got hit by 3.6M different IPs? And all from uncorrelated /16s? That seems hard to believe. Not that I don't believe you, it's just hard for me to grasp that whoever was scraping you had such a large and distributed swarm.

link

tardedmeme 10 days ago

This is called rotating residential proxy service. You can buy it off grey market sites that are probably getting it from botnet operators. It costs about $2-$5 per GB.

link

redox99 10 days ago

Interesting, that definitely seems to be it.

link

canyp 11 days ago

Curious, but how do the bots figure out the combinations? Or do you have links to the diffs from other sites? I assume the diff takes two files in query parameters or something.

link

CodeBytes 11 days ago

I'm not 100% sure but I think links. There's a bunch on the history and revision pages. Yeah, the diff URL has two revision ID's as parameters.

I did try removing some of the links without success. I guess once they have them they just keep checking.

link

account42 10 days ago

There really isn't a good reason for a wiki (or git host) to provide diffs between arbitrary revisions to unauthenticated users. Limit it to diffs compared to previous (which can be cached) and this problem goes away.

In any case, such labyrinths of expensive dynamically generated pages are no excuse for subjecting people requesting the start page to bot checks.

link

Velocifyer 10 days ago

I see many mediawiki wikis (like the Arch Linux wiki) using anubis succsefully. It can be configured to only act on certain paths.

link

Dylan16807 11 days ago

I managed to solve my scraper problems without optimizing much, but if I had to optimize I think the only option might be "don't use mediawiki" and that's an extremely obnoxious solution. Though maybe I could get there by throttling specific kinds of pages.

link