I don't see the use itself as a problem, but rather that the result is not treated as a derivative work of the input. If I train it on GPL code, the result should be GPL, too.
This is kind of like saying that any programmer who has ever learned something from reading GPL code can only use that knowledge when writing GPL code. It's not literally copying the code. The training set isn't stored on disk and regurgitated.
Also - there is logic in copilot that checks to make sure it is not suggesting exact duplicates of code from its training set, and if it does, it never sends them to the user.
But Copilot is not a programmer, Copilot is a program. Slapping the "ML" label on a program doesn't magically abdicate its programmers of all responsibility as much as tech companies over the past decade have tried to convince people otherwise.
I really dislike this false equivalence between human learning and machine learning. The two are significantly distinct in almost every way, both in their process and in their output. The scale is also vastly different. No human could possibly ingest all of the open source code on GitHub, much less regurgitate millions of snippets from what they “studied.”
> This is kind of like saying that any programmer who has ever learned something from reading GPL code can only use that knowledge when writing GPL code. It's not literally copying the code. The training set isn't stored on disk and regurgitated.
I wouldn't put any hard rules on it, but it does seem very fair for programmers who have learned a lot from GPL code to contribute back to GPL projects. I have learned from and used a lot of open source software so whenever possible I try to make projects available to learn from or use.
Yes. It is completely valid, understandable, and reasonable to have a variety of different feelings and views about how specific code and specific licenses are used.
This is particularly the case when we see the emergence of new technologies that use it in different ways. Different people may have a wide variety of equally valid views about how it is incorporated into that system.
There's nothing inconsistent, confusing, or complex about those views.
I think the issue is not that it’s trained on open source code but that it’s trained on code whose licenses may not permit it. If you license your project in a permissive way then I don’t see a problem.
(IANAL) It's a tool, transforming source code. The result thus seems like a derivative work; whether you are or are not allowed to use that in your work depends on the originating license. (And perhaps, your license. E.g., you can't derive from a GPL project and license it as MIT, as the GPL doesn't permit that. But to license as GPL would be fine. But this minimal example assumes all the input to Copilot was GPL, which I rather doubt is true, and I don't think we even know what the input was.)
I think there might be some in this thread who don't consider these derivatives, for whatever reason, but it seems to be that if rangeCheck() passes de minimis, then the output from Copilot almost certainly does, too. That a tool is doing the copying and mutating, as opposed to a human, seems immaterial to it all. (Now, I don't know that I agree with rangeCheck() not being de minimis … and yet.) Or they think that Copilot is "thinking", which, ha, no.
Open source licenses aren't a free-for-all. Many have terms like GPL's copyleft/share-alike or the attribution requirements of many other licenses. If copilot was trained on such code, then it seems that it, and/or the code it generates, violates those licenses.