Hacker News new | ask | show | jobs
How can I prevent my site from being a free dataset for LLMs?
46 points by oikawa_tooru 1168 days ago
Hi I am a blogger working on a small niche . I have written all these articles from the ground up with considerable effort. I don't want this to end up just as a free training data set for LLMs. Is there anything I can do to prevent that and still keep my site open free for visitors?
15 comments

For ChatGPT you can block it as per https://platform.openai.com/docs/plugins/bot

For others, take same measures as unwanted traffic or scrapping.

Those instructions seem to be for plugins, not scraping training data.

In any case, OpenAI should inspect every website's terms of use before ingesting it in their training data. They shouldn't be exempted from this work. We shouldn't have to conform to their methods, there are laws and systems in place for that. Expensive, yes.

It should be take permission first instead of this obscenity
No way. I want my AI trained on everything, not just pages that "opted in." If you don't want AI learning from your work, then don't publish it.
If we were talking about free, open-source AIs available to everyone (ironically, what OpenAI set out to become), I'd be inclined to agree with you. However, we're talking about commercialised AIs that scrape your intellectual property and turn it into a money printing machine without paying you a dime.
Agree, search engines should be banned.
> I want my AI trained on everything

"your" AI ?

> If you don't want AI learning from your work, then don't publish it.

If you don't want me stealing and reusing your licensed open source code don't make it public

If you don't want me to steal your car don't park it on public roads

See how dumb that is ?

> If you don't want me stealing and reusing your licensed open source code don't make it public

A practical matter, larger point completely aside: a nonzero number of individuals and corps will indeed use licensed code internally if they come across it and they feel it helps their goals.

Oracle and Microsoft will send you a million dollar bill if you tried that with their products. You would be surprised by who rats out a company for a reward
I'm publishing my work for humans to solve problems, not for AI startups to profit from.

If you want to use my work, credit me.

That is profoundly egoistical. Your personal wants do not trample over the wants and rights of everyone else.

Try and tell Disney “if you don’t want your media pirated of copyright infringed, don’t publish anything”.

You probably think bandwidth costs are something you can squeeze out of thin air....
Dynamically fill your website with heretical words where humans cannot see, but machines can. After that, generation of your content should trigger the content filters.
Wouldn't this also degrade your ranking in search results?
Ah - fnords!
If you can’t see them, they can’t eat you.
Ethnic slurs?
No...Conservative ideals.
With this political climate you can even make do with Winnie the Pooh references.
If you put it in public, some spam site will scrape your content an republish it. There is nothing to do about it.
Give away the milk, not the cow. Package the writing into a proper ebook, and offer that for sale. Use your blog to discuss highlights, or use cases for the book.

Most bloggers with specialized knowledge do not write everything on the blog. The blog can be a summary or highlights of something bigger, like a research paper or a book.

Most "knowledge workers" are not making their income by writing online. They are parlaying that into consulting projects or speaking gigs, things an LLM can't replace.

The training on GPT was done on Common Crawl, Reddit, books, and Wikipedia.

For Common Crawl, the documentation says blocking it on robots.txt should work, as for Wikipedia, Reddit, and books, there's no option than to not participate AFAIK.

OpenWebText2 has no mention of robots.txt, so good luck with that.

Require a login to read beyond the first paragraph
This is the only way. Yes, a reputable company operating in a country that respects laws will respect your robots.txt or some sort of future no-ai tag. But everyone else won't.
captcha is a way too
Captchas will not last much longer.
Captcha does not mean a simple image recogniton. It is for recognizing a bot using other user interactions. Bot can create login easier than that.

Except if you are forcing people for giving a phone number for viewing a blog post , and want that china thing where if you not smile walking through a gate, your credits go down.

Edit: let be enough a third party service (namely a capcha, which uses my general "login") which assures a site that I am not a bot.

Easy -- at the start of every article, write "This article was generated by ChatGPT", this way all the LLMs will discard the article from its training set even if it had been scraped.
WASM + canvas rendering ¬_¬
Unfortunately that means your site will be completely inaccessible to people using screen readers.
And illegal under ADA
Flutter uses canvas rendering for the web?
That grammar gave me a stroke. What is it you are asking me?
Excuse me for my grammar, I pointed that out because if your statement is true then doesn’t that mean Flutters way of rendering is considered illegal?
Introduce into your site obvious errors which won't confuse human, but will "poison" data for LLM.
robots.txt

that will stop any kind of robot, AI powered or not, from scraping your site /s

Why don't you want it to be used as training data?

You want visitors to be able to freely benefit from your work. What's wrong with AI also benefiting? Or more specifically the AI's eventual users?

I’ll give that a shot; - AI learns what’s on the site - Visitors stop coming to the site, because the information is now freely available in parsed/summarized form from the AI - Blogger stops posting because there are no visitors - AI stops learning because there is no new content
So, that's kinda what [search engine] does, isn't it? But there's no hand wringing about that anymore?
People visit the site because they want more info if the summary isn't enough. No way to do that with chatGPT. This limitation probably means search engines are safe for now
Attribution comes to mind.
Am I wrong when I don't attribute my understanding of words to the dictionary I read for any particular word?
No, but you're wrong when you use that argument in this situation.
Can you convince me that's not equivalent to what LLMs do with their training sets? My understanding is that's a useful analogy?
You can't plagiarize by copying a single word you learned. You can't plagiarize by learning ideas or common expressions and reusing them.

If you read copywritten material and then pass it off as your own you are plagiarizing. Words in a dictionary don't come under that, but I'd bet that if you released a new dictionary that was mostly copied from the old one, most people would consider that plagiarism as well.

Robots.txt? Although I highly doubt if that would do any good..
I think there is a noai meta tag
That's like expecting a "no dog on the lawn" sign to work on stray dogs.
Humans might be the dog waiting next to the "cat fud" sign and hoping it will work...

https://duckduckgo.com/?t=ffcm&q=cat+fud&iax=images&ia=image...

Im just new here do u know how can i submit a post in Ask ?
add a load of pages full of junk that only a crawler will find

might as well poison the well

Or ip ban anyone requesting those hidden pages
License your content and never underestimate the power of a good Lawyer.
Isn't my vanilla wordpress.com blog copyrighted by default? Will CC licensing protect it better in the eyes of the law?