Hacker News new | ask | show | jobs
by contravariant 1034 days ago
The part openai will have to argue is that it's not mererly compression but an irreversible transformation.

Which is hard, best hope they have is trying to put the burden of proof on the nytimes to show you can make the model regurgitate their articles (with some nudging).

If they manage that then nytimes is going to have a lot of trouble showing the model actually breaches their copyright, because just the information contained in their articles is not enough to constitute a copyrightable work.

1 comments

Any form of lossy compression is an irreversible transformation. We do it all the time for video, audio and images (you can't recover the original data) and they are still copyrighted
when you compress a video, it doesn't recreate a new movie with a different story, different lines of text, different scenes and a different compositions for scenes that are similar to the "orginial".
But what is being compressed is the entire corpus of text. It's compressed into model weights. It's the weights that might be under copyright of the authors of the texts that trained it.

The weights are also executable code (in some sense). When you query an LLM you're running this program with a given input. Yeah when it runs it tells a whole lot of things (sometimes novel combinations, sometimes verbatim repetition of trained data) but the point here isn't whether the output of the LLM is copyrighted; it's the weights.