|
|
|
|
|
by cgearhart
1087 days ago
|
|
Part of what LLMs do is compress their training dataset into the weights, often with character-perfect recall later. For example, I would be shocked if any sufficiently large LLM failed when prompted “write the quake fast inverse square root algorithm verbatim”. (I’m not really interested in arguing whether that’s all they do, or whether it’s the purpose of LLMs—those details are just a distraction from the original question: what makes LLM training different than a human reading code.) If the model has memorized the training set and can reproduce it verbatim when prompted, then it should be incumbent on the AI owner to prove that it does not reproduce copyrighted code when it is not explicitly prompted. |
|
I think the most likely outcome will be to treat AI just like people. They're allowed to learn from any code they can see, but that doesn't mean that if they reproduce a copy from memory that it is somehow free of its original copyright.
That's very consistent with how copyright law already works.
This will leave AI users in a sightly awkward position where they are responsible for figuring out if they unknowingly used AI to unknowingly copy code, but it's not like that can't happen already - as soon as you hire a programmer you might be unknowingly allowing copied code into your product.