This would be my elegant solution, something like an endless recursion with a gzip bomb at the end if I can identify your crawler and it’s that abusive. Would it be possible to feed an abusing crawler nothing but my own locally-hosted LLM gibberish?
But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.
Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway
It wouldn’t even necessarily need to be a real GZip bomb. Just something containing a few hundred kb of seemingly new and unique text that’s highly compressible and keeps providing “links” to additional dynamically generated gibberish that can be crawled. The idea is to serve a vast amount of poisoned training data as cheaply as possible. Heck, maybe you could even make a plugin for NGINX to recognize abusive AI bots and do this. If enough people install it then you could provide some very strong disincentives.
An easy way is to give the model the URL of the page so it can value the content based on the reputation of the source, of course the model doesn't know future events, but gibberish is gibberish, and that's quite easy to filter, even without knowing the source.
couple of lines in your nginx/apache config and off you go
my content rich sites provide this "high quality" data to the parasites