Hacker News new | ask | show | jobs
by Gormo 10 days ago
> It's a tool, if using data is necessary to make the tool work, then its output derives from the data.

That's simply not correct within the applicable meaning of "derives" as understood in copyright law. In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not.

Even creating works that merely draw on a single source of data, but express the ideas drawn from that in a new or transformative way, are not considered derivative works (see the ruling in Google v. Oracle, for example), let alone works based on patterns extrapolated by relating together ideas sourced from many distinct works, which is what LLMs are principally doing.

If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case.

1 comments

> That's simply not correct within the applicable meaning of "derives" as understood in copyright law.

Would be rather hard to write a definition that handles it properly back when LLMs didn't exist; not that laws particularly have anything to do with intent/desires behind FOSS anyway - intent is clearly there: you get code, under the condition that if you use it for anything, I get credited; else, you get nothing.

> In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not.

Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.

> If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case.

Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent. (not to mention the obvious differentiating factor of LLMs having arbitrarily-good memory unlike humans)

> under the condition that if you use it for anything, I get credited; else, you get nothing.

But this has never been a condition in the FOSS world, as far as I'm aware. I've only ever seen attribution requirements attach to redistribution of source, not usage of the software.

I understand that the crux of the debate here is whether training an LLM is redistribution of the underlying code, but to me, it seems to be fairly clear that it is not.

> Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.

That's literally all LLMs do. That's what tokenization is. And it's trivially provable, since if you compare LLM models with the copyrighted works you're claiming they replicate, all you'll see on the LLM side is probability matrices representing correlations between decomposed units of knowledge aggregated across the entire dataset as an integrated whole.

> Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent.

The only intent ever in play is that of the user. LLMs are just software.

> But this has never been a condition in the FOSS world, as far as I'm aware. I've only ever seen attribution requirements attach to redistribution of source, not usage of the software.

AGPL requires that even users using the software even across a network must be provided with a way to get the license (i.e. attribution) and source. Never mind that LLMs consume the source code instead of "using" the software anyway. (and of course things go more downhill for LLMs for licenses more restrictive than AGPL)

Otherwise, I'd say that, for many, the ideal condition for (copyleft) FOSS would be that anything that utilizes source code in any form also provides said source code and license/attribution. Sometimes that can even extend to outputs of software (and e.g. gcc takes time to explicitly state that its compiled code output does not count as being derived from gcc's code).

> whether training an LLM is redistribution of the underlying code

There's a funky side-note of whether LLM training can even be done on material with improperly-followed licensing; if you don't even have the permission to modify the material (as properly following MIT/GPL/etc would give you), it might be illegal to even tokenize it, never mind use it for training.

> That's literally all LLMs do. That's what tokenization is.

It's clearly not that simple, otherwise "split source into 10-char chunks, reverse that list, reverse it back, join this fun list we've gotten" would be enough to circumvent copyright.

> all you'll see on the LLM side is probability matrices representing correlations between decomposed units of knowledge aggregated across the entire dataset as an integrated whole.

Yeah, you need at least that, tokenization is irrelevant. But jury's out on this one - of course a good chunk is some form of "abstract knowledge", but other parts could be just encoding material in some compressed form (and surely gzipping a source code file doesn't circumvent copyright) that at the very least can apply to weights.

> The only intent ever in play is that of the user. LLMs are just software.

So my split-into-words-and-join-back is valid circumvention of copyright, if the user of some software doing that isn't informed that it's just effectively directly copying material. (I'll grant that perhaps, in such, the accidental-infringer might get a smaller penalty and/or get to defer punishment to whoever mismarketed the software to them,...but that wouldn't apply to anyone who knows that LLMs are very much just directly trained on copyrighted material. Don't know about legally derived, but surely mathematically derived)

Never mind that, for some things, learning some specific copyrighted code is the desired thing (humans do do this after all!), at which point at the very least the weights of the model are as copyright-infused as a gzipped source code file is.

If intent determination is on the user, and the user is aware that LLMs are very much technically capable of producing copyrighted works to some extent (which they better be), it would be on the user to ensure that any specific code they end up using is not, which is...a rather non-trivial task (a human that writes code can also reasonably-reason about whether they're infringing on whatever they learned from, but splitting into LLM writing + human checking fundamentally makes that basically infeasible).