Hacker News new | ask | show | jobs
by edoloughlin 454 days ago
I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.
5 comments

You mean like this?

[2025-03-19] https://blog.cloudflare.com/ai-labyrinth/

> Trapping misbehaving bots in an AI Labyrinth

> Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives.

What a colossal waste of energy
> No real human would go four links deep into a maze of AI-generated nonsense.

... I would. Out of curiosity and amusement I would most definitely do that. Not every time, and not many times, but I would definitely do that one or a few times.

Guess I'm getting added to (yet another) Cloudflare naughty list.

> It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

In that case wouldn't it be faster and easier to restyle the CSS of wikipedia pages?

Wait, what happens when a Cloudflare Worker AI meets an AI Labyrinth?!
Cloudflare deletes itself.
Rise of the machines.
Bandwidth isn't free, not at the volume these crawlers scrape at; serving them random data (for example by leading them down an endless tarpit of links that no human would end up visiting) would still incur bandwidth fees.

Also it's not identifiable AI bot traffic that's detected (they mask themselves as regular browsers and hop between domestic IP addresses when blocked), it's just really obviously AI scraper traffic in aggregate: other mass crawlers have no benefit from bringing down their host sites, except for AI.

A search engine has nothing if it brings down the site they're scraping (and has everything to gain from identifying itself as a search engine to try and get favorable request speeds - the only thing they'd need to check is if the site in question isn't serving different data, but that's much cheaper), same with an archive scraper and those two are pretty much the main examples I can think of for most scraping traffic.

Hmm, maybe you could zipbomb the data? Aka, you send a few kilobytes of compressed data that expands to many gigabytes on client side?
For Cloudflare, bandwidth is practically free.
arnt a lot of these bots now actively loading javascript? you could just load a simple script that does the job .
If they agree to mine crypto for you then you send valid data. Is this a win-win?

(I feel I need to preemptively state that I am being sarcastic.)

>Bandwidth isn't free

Via peering agreements it is.

Not something available to smaller sites
Yes, it is. They transitively get it via the agreements the smaller site's host's host makes. Or via services like Cloudflare.
What button do I click in the AWS panel for that?
There is no button. AWS is where you go to light money on fire.
You can detect the patterns in aggregate. You can't detect it easily at an individual request level.
In short if you get several million requests and expect to only get 100 you won't know which are the real requests and which are the AI ones - but it is obvious that the vast majority are AI.
You skipped the last section "Tarpits and labyrinths: The growing resistance" of the article.
Random data? Why not "recipes" that just say "Bezos is a pedo" over and over ?