| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Workaccount2 539 days ago
	The answer is to censor the model output, not the training input. A dumb filter using 20 year old technology can easily stop LLM's from verbatim copyright output.

2 comments

bbor 539 days ago

I know that this seems likely from a theoretical perspective (in other words, I would way underestimate it at the sprint planning meeting!), but

A) checking each output against a regex representing a hundred years of literature would be expensive AF no matter how streamlined you make it, and

B) latent space allows for small deviations that would still get you in trouble but are very hard to catch without a truly latent wrapper (i.e. another LLM call). A good visual example of this is the coverage early on in the Disney v. ChatGPT lawsuit:

[1] IEEE: https://spectrum.ieee.org/midjourney-copyright

[2] reliable ol' Gary Marcus: https://garymarcus.substack.com/p/things-are-about-to-get-a-...

link

esafak 539 days ago

What if the model simply substitutes synonyms here and there without changing the spirit of the material? (This might not work for poetry, obviously.) It is not such a simple matter.

link

Workaccount2 539 days ago

It's pretty simple, you are absolutely allowed to do that, and it's been done forever.

Imagine having the copyright claim to "Person's family member is killed so they go and get revenge".

link

esafak 539 days ago

So I can duplicate a book and change and word or two and sell it? That does not sound right.

link