Hacker News new | ask | show | jobs
by Tazerenix 1025 days ago
Realistically an AI model is basically just a very complicated piece of software. The model weights are akin to the software code, the model outputs are akin to the outputs a user of the software creates, and the datasets are akin to the intellectual property put into the software by the developer to create the code.

In the same way that a developer could not simply steal someone elses intellectual property in order to develop a feature of a piece of software, one cannot simply steal the intellectual property to adjust the model weights. The main difference is its generally quite easy to see in practice if a model has utilized some intellectual property (because for example you can ask ChatGPT to recite the first 100 words of Harry Potter) compared to another piece of software where you'd need access to the source code or developers thoughts (which could only be achieved through litigation, in most circumstances).

I think a great many people come up with convoluted answers to this question because they are uncomfortable with the reality that these very large organizations have essentially stolen hoards of intellectual property, and now that the horse has bolted people want to justify not closing the barn door. It seems to me very simple: to train an AI model on data, you must respect its copyright. The model weights should be copywriteable by the developers of the model (even if the law currently does not allow this), and the outputs of the model should be copywriteable by the person who interacted with the model (software) to produce the outputs.

The analogy with Photoshop is extremely simple: If some other software invented Gaussian blurring and copywrited it, then Adobe would have to license that technology from them to include it as a feature in Photoshop. The actual photoshop software/code would be copywrited by Adobe, and if someone created an blurry image with Photoshop they can copywrite it.

I think people only disagree with this due to some sense that the process of translating data to model weights is "automatic" or "computational" in nature. You could in principle get a person to, by hand, go through millions of data sets and compute the changes to the model weights. This is no different to someone writing a piece of code, checking someone elses approach, and adjusting their own code after the fact. It just happens that we have developed very effective tooling to automate the adjusting of the code.

2 comments

Three points I want to make:

- Models are nothing more than a statistical distillation of facts that can be traversed. They are not like software at all. Calling them software is like calling pachinko machines software. Nonsense.

- Models are mechanically derived with no element of human authorship or creativity. You could argue that there is creativity in selecting the dataset or the process that derives the model, but neither is relevant to the final generated model. Even if we assumed for the sake of argument that a model is more that a just statistical distillation, it should still not be considered copyrightable due to this reason alone.

- Don't use the word Steal when you refer to the well-defined act of infringement. Stealing implies deprivation of property which does not and cannot occur in this case. Using the word Infringe is more honest and less manipulative.

It's not stealing, and the term "intellectual property" should be put to rest:

https://www.gnu.org/philosophy/not-ipr.en.html

Your opinions on what should be copyrightable are wrong, and fortunately the courts agree.