| HN Mirror

I was not suggesting that an LLM itself consists of an arrangement of the copyrighted works comprising the training data, but that the specific selection of the copyrighted works comprising the training data is part of what differentiates one LLM from another. A strained but useful analogy might be to think of the styles of painting an artist is trained in and/or exposed to prior to creating their own art. Obvious or subtle, the art style an artist has studied would likely impact the style they develop for themself.

However, to address your point about derivative works directly, the consensus among copyright law experts appears to be that whether a particular model output is infringing depends on the standard copyright infringement analysis (and that’s regardless of the minor and correctable issue represented by memorization/overfitting of duplicate data in training sets). Only in the most unserious legal complaint (the class action filed against Midjourney, Stability AI, etc.) is the argument being made and that the models actually contain copies of the training data.