| > That's simply not correct within the applicable meaning of "derives" as understood in copyright law. Would be rather hard to write a definition that handles it properly back when LLMs didn't exist; not that laws particularly have anything to do with intent/desires behind FOSS anyway - intent is clearly there: you get code, under the condition that if you use it for anything, I get credited; else, you get nothing. > In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not. Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant. > If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case. Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent. (not to mention the obvious differentiating factor of LLMs having arbitrarily-good memory unlike humans) |
But this has never been a condition in the FOSS world, as far as I'm aware. I've only ever seen attribution requirements attach to redistribution of source, not usage of the software.
I understand that the crux of the debate here is whether training an LLM is redistribution of the underlying code, but to me, it seems to be fairly clear that it is not.
> Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.
That's literally all LLMs do. That's what tokenization is. And it's trivially provable, since if you compare LLM models with the copyrighted works you're claiming they replicate, all you'll see on the LLM side is probability matrices representing correlations between decomposed units of knowledge aggregated across the entire dataset as an integrated whole.
> Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent.
The only intent ever in play is that of the user. LLMs are just software.