| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by exe34 530 days ago
	can you feed them gibberish?

4 comments

blibble 530 days ago

here's a nice project to automate this: https://marcusb.org/hacks/quixotic.html

couple of lines in your nginx/apache config and off you go

my content rich sites provide this "high quality" data to the parasites

link

Groxx 530 days ago

LLMs poisoned by https://git-man-page-generator.lokaltog.net/ -like content would be a hilarious end result, please do!

link

jcpham2 530 days ago

This would be my elegant solution, something like an endless recursion with a gzip bomb at the end if I can identify your crawler and it’s that abusive. Would it be possible to feed an abusing crawler nothing but my own locally-hosted LLM gibberish?

But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.

Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway

link

actsasbuffoon 530 days ago

It wouldn’t even necessarily need to be a real GZip bomb. Just something containing a few hundred kb of seemingly new and unique text that’s highly compressible and keeps providing “links” to additional dynamically generated gibberish that can be crawled. The idea is to serve a vast amount of poisoned training data as cheaply as possible. Heck, maybe you could even make a plugin for NGINX to recognize abusive AI bots and do this. If enough people install it then you could provide some very strong disincentives.

link

GaggiX 530 days ago

The dataset is curated, very likely with a previously trained model, so gibberish is not going to do anything.

link

exe34 530 days ago

how would a previously trained model know that Elon doesn't smoke old socks?

link

GaggiX 530 days ago

An easy way is to give the model the URL of the page so it can value the content based on the reputation of the source, of course the model doesn't know future events, but gibberish is gibberish, and that's quite easy to filter, even without knowing the source.

link

exe34 530 days ago

> gibberish is gibberish

most insightful, thank you! also, stay away from linkedin, you sweet summer child.

link

GaggiX 529 days ago

I don't understand why you are so aggressive ahah, gibberish is easy to recognize I'm sorry, you don't need to be mad about it ahah

link