The answer is to censor the model output, not the training input. A dumb filter using 20 year old technology can easily stop LLM's from verbatim copyright output.
I know that this seems likely from a theoretical perspective (in other words, I would way underestimate it at the sprint planning meeting!), but
A) checking each output against a regex representing a hundred years of literature would be expensive AF no matter how streamlined you make it, and
B) latent space allows for small deviations that would still get you in trouble but are very hard to catch without a truly latent wrapper (i.e. another LLM call). A good visual example of this is the coverage early on in the Disney v. ChatGPT lawsuit:
What if the model simply substitutes synonyms here and there without changing the spirit of the material? (This might not work for poetry, obviously.) It is not such a simple matter.
A) checking each output against a regex representing a hundred years of literature would be expensive AF no matter how streamlined you make it, and
B) latent space allows for small deviations that would still get you in trouble but are very hard to catch without a truly latent wrapper (i.e. another LLM call). A good visual example of this is the coverage early on in the Disney v. ChatGPT lawsuit:
[1] IEEE: https://spectrum.ieee.org/midjourney-copyright
[2] reliable ol' Gary Marcus: https://garymarcus.substack.com/p/things-are-about-to-get-a-...