Hacker News new | ask | show | jobs
by dekhn 1821 days ago
No, a model trained on text covered by a license is not itself covered by the license, unless it explicitly copies the text (you cannot copyright a "style").
6 comments

But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".
No, that's not how it works.

"[...] the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"

https://copilot.github.com/#faqs

If that's the case (only 0.1%), the developers must have done something that differs from other openai experiments that suggest code sequences that I recall seeing, where significant chunks of code from Stack Overflow or similar sites were appearing in answers.
So you're gambling on whether that the code that was generated or copied.
No you aren't. Courts will consider it fair use.
How are you going to prove it was the AI that generated the GPL licensed function ad verbatim from another project, rather than you just opening that project and copying the function yourself?
I will not. Courts will simply consider a single function not to be substantive enough piece of work to constitute unfair use.
use a bloom filter to skip/regenerate that 0.1%
Synthesising material from various sources isn't copyright infringement, that's called writing.

It's only infringement if the portion copied is significant either absolutely or relatively. A line here or there of the millions in the Linux kernel is okay. A couple of lines of a haiku is not. Copyright is not leprosy.

Google Books actually displays full pages of copyrighted works Google did not license. It was considered legal.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

We all don't have Google resources. What if someone comes after us individually because some model-generated code is near identical to code in a GPL codebase? Where is the liability here?

edit: from https://copilot.github.com/

> What is my responsibility when I accept GitHub Copilot suggestions?

> You are responsible for the content you create with the assistance of GitHub Copilot. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.

Well, that solves that question.

We are all vulnerable to predatory lawyer trolls, whether we do things correctly or not. If you are accused of reusing a GPL code, then you ask clarification on which and you rewrite. It is likely to be just a snippet. I doubt Copilot would write a whole lib by copying it from another project.

And yes, of course github is not going to take responsibility for things you do with their tools.

If you learn programming from Stack Overflow and Github, and then repeat something that you learned over your time at reading, that's not just copying text. That's having learned the text. You could say the human brain is mashing together several different code examples, taking some text from each.
hmm, so let's think this through.

Wouldn't that imply that a person who learned to code on GPLv2 sources wrote writes some more code in that style (including "long strings of code", some of which are clearly not unique to GPL) is writing code that is "born GPLv2"?

I don't think it currently works that way.

My guess is that it is, if we think of a machine learning framework as a compiler and the model as compiled code. Compiled GPL code is still GPL, that's the entire point.

Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.

> you cannot copyright a "style"

This line of thinking applies to the code generated by the model, but not necessarily to the model itself, or the training of it.

Thanks- in retrospect, I shoudl have explicitly said "code generated by the model".
The trained model is a derivative work that contains copies of the corpus used for training embedded in the model. If any of the training code was GPL the output is now covered by GPL. The music industry has already done most of the heavy lifting here in terms of scope and nature of derived works, and while IANAL I would not suggest that it looks good for anyone using this tool if GPL code was in the training set.
Well, it probably is explicitly copying at least some subset of the source text - otherwise the code would be syntactically invalid, no?
I can't say what's happening in GitHub Copilot, but it's not necessarily true that the only way to produce syntactically valid outputs is to take substrings of the source text. It is possible to learn something approximating a generative grammar.

Take a look at https://karpathy.github.io/2015/05/21/rnn-effectiveness/

At the same time, I would not be surprised if there are outputs that do correspond to the source training data.

Strictly speaking, you could train a model which does not contain the original source text (just the underlying language structure and work tokens), and generates ASCII strings which are consistent with the underlying generative model, that are also always valid code. I expect to see code generator models that explicitly generate valid code as part of their generalization capability.
There will almost certainly be cases where it copies exact lines. When working with GPT2 I got whole chunks of news articles.