Hacker News new | ask | show | jobs
by Workaccount2 492 days ago
The answer is to censor the model output, not the training input. A dumb filter using 20 year old technology can easily stop LLM's from verbatim copyright output.
2 comments

I know that this seems likely from a theoretical perspective (in other words, I would way underestimate it at the sprint planning meeting!), but

A) checking each output against a regex representing a hundred years of literature would be expensive AF no matter how streamlined you make it, and

B) latent space allows for small deviations that would still get you in trouble but are very hard to catch without a truly latent wrapper (i.e. another LLM call). A good visual example of this is the coverage early on in the Disney v. ChatGPT lawsuit:

[1] IEEE: https://spectrum.ieee.org/midjourney-copyright

[2] reliable ol' Gary Marcus: https://garymarcus.substack.com/p/things-are-about-to-get-a-...

What if the model simply substitutes synonyms here and there without changing the spirit of the material? (This might not work for poetry, obviously.) It is not such a simple matter.
It's pretty simple, you are absolutely allowed to do that, and it's been done forever.

Imagine having the copyright claim to "Person's family member is killed so they go and get revenge".

So I can duplicate a book and change and word or two and sell it? That does not sound right.