Hacker News new | ask | show | jobs
by clbrmbr 823 days ago
> IMHO models trained on not-properly-licensed (pirated) data should at the very least not be copyrightable and should be public domain.

My understanding is that ML model weights cannot be copyrighted as an original creative work. They are trade-secrets and protected through contracts but once leaked to third parties it’s not a copyright violation to use/distribute.

Whether the model is actually a derivative work of the training data is another interesting question.

Or is my theory off here?

1 comments

The main argument I have seen (which is also OpenAI's in their legal briefs) is that it is fair use. The idea of "fair use" is that you are conceding that you are infringing by creating a derivative work, but it's still okay. Implied in the fair use argument is that it is a derivative work.
> Implied in the fair use argument is that it is a derivative work.

You can get all LLMs to spit out almost exact copies of known IP visuals from movies and games. For instance, with Dalle-E and Midjourney, it's relatively easy to get similar pictures from film and game studios. Those are copies with minor changes. It would be hard to argue otherwise in court. The same happens with ChatGPT spitting out verbatim passages from New York Times articles.