Hacker News new | ask | show | jobs
by jcpham2 528 days ago
This would be my elegant solution, something like an endless recursion with a gzip bomb at the end if I can identify your crawler and it’s that abusive. Would it be possible to feed an abusing crawler nothing but my own locally-hosted LLM gibberish?

But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.

Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway

1 comments

It wouldn’t even necessarily need to be a real GZip bomb. Just something containing a few hundred kb of seemingly new and unique text that’s highly compressible and keeps providing “links” to additional dynamically generated gibberish that can be crawled. The idea is to serve a vast amount of poisoned training data as cheaply as possible. Heck, maybe you could even make a plugin for NGINX to recognize abusive AI bots and do this. If enough people install it then you could provide some very strong disincentives.