| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cgearhart 1087 days ago

Part of what LLMs do is compress their training dataset into the weights, often with character-perfect recall later. For example, I would be shocked if any sufficiently large LLM failed when prompted “write the quake fast inverse square root algorithm verbatim”.

(I’m not really interested in arguing whether that’s all they do, or whether it’s the purpose of LLMs—those details are just a distraction from the original question: what makes LLM training different than a human reading code.)

If the model has memorized the training set and can reproduce it verbatim when prompted, then it should be incumbent on the AI owner to prove that it does not reproduce copyrighted code when it is not explicitly prompted.

1 comments

IshKebab 1086 days ago

So it's just about accuracy of recall then, not use of training data?

I think the most likely outcome will be to treat AI just like people. They're allowed to learn from any code they can see, but that doesn't mean that if they reproduce a copy from memory that it is somehow free of its original copyright.

That's very consistent with how copyright law already works.

This will leave AI users in a sightly awkward position where they are responsible for figuring out if they unknowingly used AI to unknowingly copy code, but it's not like that can't happen already - as soon as you hire a programmer you might be unknowingly allowing copied code into your product.

link

cgearhart 1086 days ago

No, I don’t think it’s just a question of recall accuracy. They issue really hinges on whether or not the AI itself is a derivative work of the training data, as I think that would trigger certain requirements in the original source licenses. Lots of folks seem to think that it is not a derivative work because (a) the model is just a bunch of numeric weights, it doesn’t contain any explicit code; and (b) it’s possible for the model to output original code in some cases. But that’s flawed reasoning because it’s quite clear that the model weights do contain perfect copies of at least some training code, and the models can produce that code perfectly (without the original license) when prompted. Thus it seems clear that the model itself should be treated as a derivative work, whereas a human is not—even if they memorize the code they read.

link

IshKebab 1086 days ago

Why is a human not though? I don't think it's as simple as you imagine. A human who has memorized the information contains it just as much as the weights.

link

cgearhart 1086 days ago

Both human and LLM may learn from reading code to produce novel, derivative, or duplicative work—but that’s not the issue, because the model itself is a derivative of the training data and the human is not. That does seem very simple to me.

If we just zipped up the entire training data set and distributed it with the model then it would clearly be a copy and/or derivative work. The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights). Folks just seem to think that it’s not a derivative work because an LLM _also_ does more than that sometimes (e.g., extrapolates from the training data to produce novel token sequences as output).

link

IshKebab 1086 days ago

> the human is not

Why not? Humans store information in their brains that they have learnt. So do AIs. What exactly is the difference between a weight in an Artificial Neural Network and a weight in a Natural Neural Network?

If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.

> The LLM does the same thing as zipping (i.e., compresses the training data…by encoding it in the model weights).

It's not at all the same. It's highly lossy. Only extremely highly repeated works get memorised exactly and even then it's often not exact.

LLMs do not contain a copy of all the training data (if trained properly). I agree if that was the case then it would be different, but that isn't how they work (unless you badly overfit).

link

JoshTriplett 1086 days ago

> If the answer is "humans get special treatment" then that's fine I guess but I think it's worth being explicit that that's the difference.

That's absolutely the difference. Humans aren't copyrightable; the alternative would be unconscionable.

> even then it's often not exact

You don't have to copy something exactly to be a derivative work. "Lord of the Rings but a random 15% of words are replaced with gibberish" is still a derivative work of Lord of the Rings. So is "Lord of the Rings but every word/sentence is paraphrased".

link

cgearhart 1086 days ago

An LLM contains some portion of the training data exactly and the rest of it lossily. What I’m really arguing is that _alone_ that is enough to make the model itself a derivative work. It actually doesn’t matter whether that’s the same or different than a human; that’s a distraction. The AI model is itself a work that is derived from the training data.

link