|
Basically, the difference is that you merely reading code doesn't create a derivative work that we can meaningfully look at. Yes, it gets stored in your brain but your brain re-encodes all that knowledge in a way only it can use. We're still quite a bit away from brain uploading at the moment, so that's not a meaningful avenue to discuss right now. An LLM on the other hand generally works off of a model that was trained first, and that model can be saved to a file and read out later. As a result, it's a derivative work that we can examine, copy, share, modify and do all the things with that we generally attribute to something being a Work. The question on if binary output from a program can be copyrighted is somewhat unclear, but from what I've heard legally (not legal advice, I Am Not A Lawyer), it seems to be the case unless you explicitly say it's not[0]. There's a few other things to consider like how you, as a human, can make the conscious decision to avoid specifically replicating GPL code that you've seen if you're not allowed to use it (whether that is by restructuring the code, doing the same techniques in a different language, or the heaviest example which is clean-rooming it). AIs don't have the ability to make that distinction (and to my understanding due to how they work, the only way you can meaningfully avoid it is if you ensure that the entire model is compliant to avoid the AI going off on it's own tangent and making the decision to include incompatible code.) From a more practical perspective - Copilot will happily spit out and apply the wrong license to Quake IIIs fast inverse square root algorithm function. It's GPL licensed code but it IIRC claimed it was BSD licensed? That alone would constitute a violation and it'd be weird to not point at the people who trained the model that allowed it to make that choice. To be fair, right now a lot of this is up in the air and all we have to go on is kinda wishy-washy guidance from copyright offices (which is mostly just refusing registration on the basis that a copyrighted material has to be made by a human, not by a machine). There's a couple of ongoing lawsuits specifically about Copilot that are still pending and from what I last heard, the judges aren't very impressed by the defense of GitHub/MSFT/OpenAI. The approach also greatly differs per country/governing body - Japans government has for example given blanket permission for non-commercial AI training, while keeping a strict eye on anyone trying to use it for paid services, while the EU is passing legislation that seems to mostly lean towards "it's copyrighted, that's now your problem to get in line with it", without outright saying it yet. [0]: This is the main reason why for FOSS, the Creative Commons License usually is not seen as a good pick outside of assets, because it can interfere with distributing compiled versions of your code. |
This is incorrect; it doesn't matter whether it's commercial or non-commercial, and you can use anything as training data regardless of copyright. See the amendment of the copyright law from 2018.