| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Jensson 335 days ago
	> LLMs are hardly reliable ways to reproduce copyrighted works Only because the companies are intentionally making it so. If they weren't trained to not reproduce copyrighted works they would be able to.

3 comments

ben_w 335 days ago

They're probably training them to refuse, but fundamentally the models are obviously too small to usually memorise content, and can only do it when there's many copies in the training set. Quotation is a waste of parameters better used for generalisation.

The other thing is that approximately all of the training set is copyrighted, because that's the default even for e.g. comments on forums like this comment you're reading now.

The other other thing is that at least two of the big model makers went and pirated book archives on top of crawling the web.

link

jazzyjackson 335 days ago

it's like these people never tried asking for song lyrics

link

terminalshort 335 days ago

LLMs even fail on tasks like "repeat back to me exactly the following text: ..." To say they can exactly and reliably reproduce copyrighted work is quite a claim.

link

tomschwiha 335 days ago

You can also ask people to repeat a text and some will fail. What I want to say is that even if some LLMs (probably only older ones) will fail doesn't mean future ones will fail (in the majority). Especially if benchmarks indicate they are becoming smarter over time.

link