| Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case. More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2]. The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%. The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days. The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator. I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further. 1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents) 2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr... 3. https://codeberg.org/forgejo/discussions/issues/319#issuecom... 4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-... |
I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)
> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.
These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.