| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by grayhatter 1167 days ago
	Why don't you want it to be used as training data? You want visitors to be able to freely benefit from your work. What's wrong with AI also benefiting? Or more specifically the AI's eventual users?

2 comments

zorak8me 1167 days ago

I’ll give that a shot; - AI learns what’s on the site - Visitors stop coming to the site, because the information is now freely available in parsed/summarized form from the AI - Blogger stops posting because there are no visitors - AI stops learning because there is no new content

link

grayhatter 1167 days ago

So, that's kinda what [search engine] does, isn't it? But there's no hand wringing about that anymore?

link

ipaddr 1167 days ago

People visit the site because they want more info if the summary isn't enough. No way to do that with chatGPT. This limitation probably means search engines are safe for now

link

vikp 1167 days ago

Attribution comes to mind.

link

grayhatter 1167 days ago

Am I wrong when I don't attribute my understanding of words to the dictionary I read for any particular word?

link

schwartzworld 1167 days ago

No, but you're wrong when you use that argument in this situation.

link

grayhatter 1167 days ago

Can you convince me that's not equivalent to what LLMs do with their training sets? My understanding is that's a useful analogy?

link

schwartzworld 1167 days ago

You can't plagiarize by copying a single word you learned. You can't plagiarize by learning ideas or common expressions and reusing them.

If you read copywritten material and then pass it off as your own you are plagiarizing. Words in a dictionary don't come under that, but I'd bet that if you released a new dictionary that was mostly copied from the old one, most people would consider that plagiarism as well.

link

grayhatter 1167 days ago

I completely agree with this; but my understanding about how LLM work is that they don't copy meaningful segments of text from any specific source. Instead, they predict the next block of text, which they'd only do if they've seen that idea/sequence enough times with context to rank the prediction high enough.

I haven't seen any service copy out large block of text enough to make me think it's reasonable to call their output plagiarized.

Meaning, if the LLM I use will only repeat an idea that many someone's have written about, such that it's seen the idea, or parts of that idea many times. Why is that still plagiarism? Or rather, worthy of direct attribution? Or why was I wrong to use the argument about citing a dictionary here?

(I'm aware that a number of people are working on giving memory so AI can quote from pages like wikipedia. But I don't think it's fair to call that "training data")

link