Hacker News new | ask | show | jobs
by gorkempacaci 1034 days ago
I tend to agree with you, but, one could argue “statistical collection of words” is a form of compression? For example, you can’t write a kids version of a novel and sell that without dealing with copyright.
2 comments

The part openai will have to argue is that it's not mererly compression but an irreversible transformation.

Which is hard, best hope they have is trying to put the burden of proof on the nytimes to show you can make the model regurgitate their articles (with some nudging).

If they manage that then nytimes is going to have a lot of trouble showing the model actually breaches their copyright, because just the information contained in their articles is not enough to constitute a copyrightable work.

Any form of lossy compression is an irreversible transformation. We do it all the time for video, audio and images (you can't recover the original data) and they are still copyrighted
when you compress a video, it doesn't recreate a new movie with a different story, different lines of text, different scenes and a different compositions for scenes that are similar to the "orginial".
But what is being compressed is the entire corpus of text. It's compressed into model weights. It's the weights that might be under copyright of the authors of the texts that trained it.

The weights are also executable code (in some sense). When you query an LLM you're running this program with a given input. Yeah when it runs it tells a whole lot of things (sometimes novel combinations, sometimes verbatim repetition of trained data) but the point here isn't whether the output of the LLM is copyrighted; it's the weights.

The model is a model. It's part of a compression algorithm. The compressed data would be the prompt + choice of which predicted tokens to accept (e.g. when not always choosing the most likely next token). The end-user is supplying the prompt and the choice function is randomized/not being used to store data, thus the end user is providing the compressed data.