How can I prevent my site from being a free dataset for LLMs?

Y	Hacker News new \| ask \| show \| jobs

	How can I prevent my site from being a free dataset for LLMs?
	46 points by oikawa_tooru 1168 days ago
	Hi I am a blogger working on a small niche . I have written all these articles from the ground up with considerable effort. I don't want this to end up just as a free training data set for LLMs. Is there anything I can do to prevent that and still keep my site open free for visitors?

15 comments

throwaway888abc 1168 days ago

For ChatGPT you can block it as per https://platform.openai.com/docs/plugins/bot

For others, take same measures as unwanted traffic or scrapping.

link

gtirloni 1167 days ago

Those instructions seem to be for plugins, not scraping training data.

In any case, OpenAI should inspect every website's terms of use before ingesting it in their training data. They shouldn't be exempted from this work. We shouldn't have to conform to their methods, there are laws and systems in place for that. Expensive, yes.

link

villgax 1167 days ago

It should be take permission first instead of this obscenity

link

welshwelsh 1167 days ago

No way. I want my AI trained on everything, not just pages that "opted in." If you don't want AI learning from your work, then don't publish it.

link

rantallion 1167 days ago

If we were talking about free, open-source AIs available to everyone (ironically, what OpenAI set out to become), I'd be inclined to agree with you. However, we're talking about commercialised AIs that scrape your intellectual property and turn it into a money printing machine without paying you a dime.

link

drstewart 1167 days ago

Agree, search engines should be banned.

link

lm28469 1167 days ago

> I want my AI trained on everything

"your" AI ?

> If you don't want AI learning from your work, then don't publish it.

If you don't want me stealing and reusing your licensed open source code don't make it public

If you don't want me to steal your car don't park it on public roads

See how dumb that is ?

link

DANmode 1167 days ago

> If you don't want me stealing and reusing your licensed open source code don't make it public

A practical matter, larger point completely aside: a nonzero number of individuals and corps will indeed use licensed code internally if they come across it and they feel it helps their goals.

link

ipaddr 1167 days ago

Oracle and Microsoft will send you a million dollar bill if you tried that with their products. You would be surprised by who rats out a company for a reward

link

nicbou 1167 days ago

I'm publishing my work for humans to solve problems, not for AI startups to profit from.

If you want to use my work, credit me.

link

latexr 1167 days ago

That is profoundly egoistical. Your personal wants do not trample over the wants and rights of everyone else.

Try and tell Disney “if you don’t want your media pirated of copyright infringed, don’t publish anything”.

link

villgax 1167 days ago

You probably think bandwidth costs are something you can squeeze out of thin air....

link

_v7gu 1167 days ago

Dynamically fill your website with heretical words where humans cannot see, but machines can. After that, generation of your content should trigger the content filters.

link

DoingIsLearning 1167 days ago

Wouldn't this also degrade your ranking in search results?

Ah - fnords!

If you can’t see them, they can’t eat you.

link

klooney 1167 days ago

Ethnic slurs?

link

tanseydavid 1167 days ago

No...Conservative ideals.

link

_v7gu 1167 days ago

With this political climate you can even make do with Winnie the Pooh references.

link

tommek4077 1167 days ago

If you put it in public, some spam site will scrape your content an republish it. There is nothing to do about it.

link

rchaud 1167 days ago

Give away the milk, not the cow. Package the writing into a proper ebook, and offer that for sale. Use your blog to discuss highlights, or use cases for the book.

Most bloggers with specialized knowledge do not write everything on the blog. The blog can be a summary or highlights of something bigger, like a research paper or a book.

Most "knowledge workers" are not making their income by writing online. They are parlaying that into consulting projects or speaking gigs, things an LLM can't replace.

link

vhcr 1167 days ago

The training on GPT was done on Common Crawl, Reddit, books, and Wikipedia.

For Common Crawl, the documentation says blocking it on robots.txt should work, as for Wikipedia, Reddit, and books, there's no option than to not participate AFAIK.

OpenWebText2 has no mention of robots.txt, so good luck with that.

link

quickthrower2 1167 days ago

Require a login to read beyond the first paragraph

link

senttoschool 1167 days ago

This is the only way. Yes, a reputable company operating in a country that respects laws will respect your robots.txt or some sort of future no-ai tag. But everyone else won't.

link

azatom 1167 days ago

captcha is a way too

link

flangola7 1167 days ago

Captchas will not last much longer.

link

azatom 1167 days ago

Captcha does not mean a simple image recogniton. It is for recognizing a bot using other user interactions. Bot can create login easier than that.

Except if you are forcing people for giving a phone number for viewing a blog post , and want that china thing where if you not smile walking through a gate, your credits go down.

Edit: let be enough a third party service (namely a capcha, which uses my general "login") which assures a site that I am not a bot.

link

xeonmc 1167 days ago

Easy -- at the start of every article, write "This article was generated by ChatGPT", this way all the LLMs will discard the article from its training set even if it had been scraped.

link

zigzag312 1167 days ago

WASM + canvas rendering ¬_¬

link

themoonisachees 1167 days ago

Unfortunately that means your site will be completely inaccessible to people using screen readers.

link

flangola7 1167 days ago

And illegal under ADA

link

Alifatisk 1167 days ago

Flutter uses canvas rendering for the web?

link

flangola7 1166 days ago

That grammar gave me a stroke. What is it you are asking me?

link

Alifatisk 1166 days ago

Excuse me for my grammar, I pointed that out because if your statement is true then doesn’t that mean Flutters way of rendering is considered illegal?

link

TheLoafOfBread 1167 days ago

Introduce into your site obvious errors which won't confuse human, but will "poison" data for LLM.

link

amne 1167 days ago

robots.txt

that will stop any kind of robot, AI powered or not, from scraping your site /s

link

grayhatter 1167 days ago

Why don't you want it to be used as training data?

You want visitors to be able to freely benefit from your work. What's wrong with AI also benefiting? Or more specifically the AI's eventual users?

link

zorak8me 1167 days ago

I’ll give that a shot; - AI learns what’s on the site - Visitors stop coming to the site, because the information is now freely available in parsed/summarized form from the AI - Blogger stops posting because there are no visitors - AI stops learning because there is no new content

link

grayhatter 1167 days ago

So, that's kinda what [search engine] does, isn't it? But there's no hand wringing about that anymore?

link

ipaddr 1167 days ago

People visit the site because they want more info if the summary isn't enough. No way to do that with chatGPT. This limitation probably means search engines are safe for now

link

vikp 1167 days ago

Attribution comes to mind.

link

grayhatter 1167 days ago

Am I wrong when I don't attribute my understanding of words to the dictionary I read for any particular word?

link

schwartzworld 1167 days ago

No, but you're wrong when you use that argument in this situation.

link

grayhatter 1167 days ago

Can you convince me that's not equivalent to what LLMs do with their training sets? My understanding is that's a useful analogy?

link

schwartzworld 1167 days ago

You can't plagiarize by copying a single word you learned. You can't plagiarize by learning ideas or common expressions and reusing them.

If you read copywritten material and then pass it off as your own you are plagiarizing. Words in a dictionary don't come under that, but I'd bet that if you released a new dictionary that was mostly copied from the old one, most people would consider that plagiarism as well.

link