Hacker News new | ask | show | jobs
by shakna 1807 days ago
> No, see Authors Guild v. Google.

That case required that the output be transformative, in that "words in books are being used in a way they have not been used before".

Copilot only fits the transformative aspect if it is not directly reciting code, that already exists in the form that it is redistributing. So long as it does so, it fails to meet the criteria.

3 comments

I think you might be considering two different acts here:

1. The act of training Copilot on public code

2. The resulting use of Copilot to generate presumably new code

#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.

#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.

(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)

Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

Absent this, I don't think there's a case. The courts have given extraordinarily wide latitude to fair use and ML algorithms are routinely trained on copyrighted works, photos, etc. without a license.

I understand that this feels more personal because it involves our field, but artists and authors have expressed the same sentiment when neural nets began making pictures and sentences.

The question here is no different than "Is GPT-3 an unlicensed, unlawfully created derivative work of millions, if not billions of people?"

No, I'm quite confident it is not.

> Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

It doesn't need to be substantial. In Google v. Oracle a 9-line function was found to be infringing.

If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf

> The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

Yes, because it was _transformative_, in a clear way. Because an API is only an interface. Which makes that part of that decision largely irrelevant to the topic at hand.

> Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different compu-ting environment without discarding a portion of a familiar program-ming language. Google’s purpose was to create a different task-related system for a different computing environment (smartphones) and tocreate a platform—the Android platform—that would help achieve and popularize that objective.

> If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

It was already decided earlier, and Google did not contest it, choosing instead to negotiate a zero payment settlement with Oracle over the rangeCheck function. There was no need for the Supreme Court to hear it.

A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

> A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

That's not the case. It wasn't an out-of-court-settlement, but an agreement about the damages being sought, the court had already found it to be infringing, and that was part of the ruling.

But none of that changes that 9-lines is substantial enough to be infringing. It isn't necessary to be a large body of work.

> If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

No... It means the rangeCheck function was infringing. The implication you seem to have inferred here wouldn't be inferred by any kind of plagiarism case.

The point of Copilot -- its entire value as a product -- is to produce code that matches the intent and semantics of code that was in the input. In other words, very deliberately not transformative in purpose.
Why did you choose the standard of "substantial" = "100s of lines"? Especially since we've already seen examples of verbatim output in the dozens of lines range, that choice of standard is rather conveniently just outside what exists so far. If we find a case with 200 lines of verbatim output will you say the only reasonable standard is 1000s of lines?

I don't think your argument is as strong as you're making it out to be.

Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use. I would be surprised if many of haven't inadvertently "copied" some GPL code in this way!

This goes to the "substantial" test for fair use. Clips from a film can contain core plot points, quotes from a book can contain vital passages to understanding a character, screen captures and scrapes of a website can contain huge amounts of textual detail, but depending on the four factors for fair use, still be fair use. (There have been exceptions though.)

The reaction on Hacker News to a machine producing code trained on their works is no different than the reactions artists and writers have had to other ML models. I suspect many of us are biased because it strikes at what we do and we think that our copyrights (because we have so many neat licenses) are special. They are not.

I think it would need to get to that level of "Copilot will emit a kernel module" before it's not obviously fair use.

After all, Google Books will happily convey to me whole pages from copyrighted works, page after page after page.

https://www.google.com/books/edition/Capital_in_the_Twenty_F...

> Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use.

it's anything but obvious. https://www.copyright.gov/fair-use/

> there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

9 lines of very run-of-the-mill code in Oracle / Google weren't considered fair use.

A big difference is that software is both is and isn't an artistic work.
It's not possible to get copilot to output a transformed version of the input?
Transformed output _may_ fall under fair use.

However - Copilot directly recites code. That is _very unlikely_ to fall under fair use.

Redistributing the exact same code, in the same form, for the same purpose, probably means that Copilot, and thus the people responsible for it, are infringing.

> However - Copilot directly recites code.

You make that statement as an absolute, but in the interests of clarity, all evidence so far shows that it directly recites code very rarely indeed. Even the Quake example had to be prompted by the specific variable names used in the original code.

In practice, the output code is heavily influenced by your own context — the comments you include, the variable names you use, even the name of the file you are editing — and with use it’s obvious that the code is almost certainly not a direct recitation of any existing code.

> all evidence so far shows that it directly recites code very rarely indeed.

_Once_ is enough for it to be infringing. The law is not very forgiving when you try and handwave it away.

You sound quite sure that the outlying instances of direct copying wouldn't be covered by the Fair Use copyright exemption. Any particular reason for that?

I tend to think it would be covered (provided it there were relatively small snippets and not entire functions).

I'm not the person you're replying to, but one strong reason is that the global reach and standardization of copyright law is far broader than the global reach and standardization of the fair use exception. A single non-US country in which GitHub Copilot is used in a way that would be infringing without the US fair use exception, and outside the scope of any such exception in that law, would be enough to cause GitHub/MS a legal hassle. There could well be more than one such country.
Precedent. Google v. Oracle found 9 lines, of an "obvious" implementation to be infringing.
So if a foreign company pilfers the source code to Windows, can they add it to a training set and then 'prompt' the machine learning algorithm to spit out a new 'copyright free' Windows, just by transforming the variable names?
I think that's my question regarding this whole thing:

If it's so fair use, why not train it on all Microsoft code, regardless of license (in addition to GitHub.com) ? Would Microsoft employees be fine with Copilot re-creating "from memory" portions of Windows to use in WINE ?

Well no, because only GitHub has access to the training set. But more importantly this misunderstands how Copilot even works -- even if Windows was in the training set, you couldn't get Copilot to reproduce it. It only generates a few lines of code at a time, and even then it's almost certainly entirely novel code.

Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.

GPT-3 is still Microsoft licensed, but a similar model can be put together with the freely available GPT-2 and source code -- especially if your intent is copyright transfer.

As Francois Chollet points out in this talk, ultimately deep neural network models are locally sensitive hash tables, so the examples of people pulling out source code is an inherent shortcoming of deep learning models in general. Give the right 'key' and you can 'recall' the value you are looking for.

https://www.youtube.com/watch?v=J0p_thJJnoo

> "However - Copilot directly recites code."

Sounds like that wouldn't be difficult to fix? Transform the code to an intermediate representation (https://en.wikipedia.org/wiki/Intermediate_representation) as a pre-processing stage, which ditches any non-essential structure of the code and eliminates comments, variable names, etc., before running the learning algorithms on it. Et voila, much like a human learning something and reimplementing it, only essential code is generated without any possibility of accidentally regurgitating verbatim snippets of the source data.

At that point, can we all just agree IP is the stupidest concept to ever be layered on top of math (which programming is) and move on with non-copyrightable code?
Only if you agree that copyleft licenses are also stupid; without copyright, there's no way to prevent companies from making closed-source forks of code you wrote and intended to stay open.
The whole point of copyleft was as a stepping stone to get to RMS's four freedoms (https://www.gnu.org/philosophy/free-sw.en.html) which effectively eliminates copyright for software.
Correct. Copyleft is idiocy as well. You don't really need a pay for a proprietary fork of a tool when no one can keep you out of the free one, and the proprietary stuff diffuses into the free option.
Yes, sure. Without copyright there's no need for copyleft left, right?