No, a model trained on text covered by a license is not itself covered by the license, unless it explicitly copies the text (you cannot copyright a "style").
But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".
"[...] the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"
If that's the case (only 0.1%), the developers must have done something that differs from other openai experiments that suggest code sequences that I recall seeing, where significant chunks of code from Stack Overflow or similar sites were appearing in answers.
How are you going to prove it was the AI that generated the GPL licensed function ad verbatim from another project, rather than you just opening that project and copying the function yourself?
Synthesising material from various sources isn't copyright infringement, that's called writing.
It's only infringement if the portion copied is significant either absolutely or relatively. A line here or there of the millions in the Linux kernel is okay. A couple of lines of a haiku is not. Copyright is not leprosy.
We all don't have Google resources. What if someone comes after us individually because some model-generated code is near identical to code in a GPL codebase? Where is the liability here?
> What is my responsibility when I accept GitHub Copilot suggestions?
> You are responsible for the content you create with the assistance of GitHub Copilot. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.
We are all vulnerable to predatory lawyer trolls, whether we do things correctly or not. If you are accused of reusing a GPL code, then you ask clarification on which and you rewrite. It is likely to be just a snippet. I doubt Copilot would write a whole lib by copying it from another project.
And yes, of course github is not going to take responsibility for things you do with their tools.
If you learn programming from Stack Overflow and Github, and then repeat something that you learned over your time at reading, that's not just copying text. That's having learned the text. You could say the human brain is mashing together several different code examples, taking some text from each.
Wouldn't that imply that a person who learned to code on GPLv2 sources wrote writes some more code in that style (including "long strings of code", some of which are clearly not unique to GPL) is writing code that is "born GPLv2"?
My guess is that it is, if we think of a machine learning framework as a compiler and the model as compiled code. Compiled GPL code is still GPL, that's the entire point.
Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.
The trained model is a derivative work that contains copies of the corpus used for training embedded in the model. If any of the training code was GPL the output is now covered by GPL. The music industry has already done most of the heavy lifting here in terms of scope and nature of derived works, and while IANAL I would not suggest that it looks good for anyone using this tool if GPL code was in the training set.
I can't say what's happening in GitHub Copilot, but it's not necessarily true that the only way to produce syntactically valid outputs is to take substrings of the source text. It is possible to learn something approximating a generative grammar.
Strictly speaking, you could train a model which does not contain the original source text (just the underlying language structure and work tokens), and generates ASCII strings which are consistent with the underlying generative model, that are also always valid code. I expect to see code generator models that explicitly generate valid code as part of their generalization capability.