Hacker News new | ask | show | jobs
by xxpor 1792 days ago
Software licenses have barely been tested in court, let alone how they apply to code injected and combined with other code via machine learning. You're extremely overconfident about how this will actually play out.

For one, just because your code is covered by the GPL, it doesn't mean every single line in isolation is copyrightable. It has to demonstrate creativity. That's why you don't have to worry about writing for (int i = 0; i < idx; i++) {.

8 comments

You're right that code has to demonstrate creativity for copyright. But that also means that an algorithm, even a transformative algorithm, cannot change copyright because an algorithm is not creative, by definition.

This means that the output of any algorithm on copyrighted code is still under the original copyright. I mean, we still apply the copyright of the original to the output of compilers, even though compilers can be transformative with inlining and link-time optimization, to the point that it mixes disparate code in the same way Copilot does.

In fact, I wrote some software licenses [1] that codify the fact that algorithms cannot change copyright.

[1]: https://yzena.com/licenses/

You sound very confident about this, whereas copyright lawyers I've read discuss this issue seem much less confident overall, but lean toward thinking this would be fair use.

What makes you so confident that this would not be ruled fair use?

(And for people not familiar - if ruled fair use, it doesn't matter what the license is because fair use is an exception to copyright itself.)

I have a feeling you did not read the FAQ of the licenses. I don't blame you, but they explain my position.

Here's the relevant quote:

> GitHub is arguing that using FOSS code in Copilot is fair use because using data for training a machine learning algorithm has been labelled as fair use. [1]

> However, even though the training is supposedly fair use, that doesn’t mean that the distribution of the output of such algorithms is fair use.

My licenses say, basically, "Sure, training is fair use, but distributing the output is not."

The licenses specifically say that the copyright applies to any output of any algorithm that uses the source code code as all or part of its input.

Now, I have not gotten a lawyer to look at my licenses yet (it's in the works), so don't use them yourself. But because everyone keeps saying that training is fair use, I'm fairly confident that only training is fair use.

Of course, it might not be, but that would take more court cases and more precedent. I wanted to poison the well now [2] to make companies nervous about using a model that was partially trained with code licensed under my licenses.

[1]: https://valohai.com/blog/copyright-laws-and-machine-learning...

[2]: https://gavinhoward.com/2021/07/poisoning-github-copilot-and...

> My licenses say, basically, "Sure, training is fair use, but distributing the output is not."

Licenses basically by definition cannot say what is and isn't fair use...

> Licenses basically by definition cannot say what is and isn't fair use...

Yes. However, my licenses only say what people already say. Then the licenses go further and say, "But anything else is not allowed."

Everyone else says training is fair use. My licenses agree. But they make it clear that I don't believe that anything else is fair use.

Yes, these licenses must be tested in court. Except that they poison the well now.

It's mildly interesting that you've decided to express your personal opinion about what is or is not fair use within in your license text, but the fact is that if a use of the work is deemed to be fair use under the law then the terms of the license you're offering are completely irrelevant. Your permission is not required to make fair use of the work, so no one needs to agree to your license.
Licenses can't dictate what is not allowed unless the user wants to use it in a way compliant with the rest of the license. If you decide to not follow the license at all, then it's effectively like any other copyright where you can use it without the owner's permission under fair use.

That doesn't usually mean you can use code though, see: https://news.ycombinator.com/item?id=27726343

> it doesn't mean every single line in isolation is copyrightable

Microsoft did not just copy individual lines. They fed whole repositories into their model, ignoring the license (if it exists) even though they knew from the start that information generated by the model will be publicly available. Available usually out of context, but nonetheless - the scope of the input and intent are very clearly "everything" and "redistribution".

Just adding a filter/ML model to the output shouldn't matter. I dare you to build a Copilot clone trained from leaked internal Microsoft code and then trying to argue the output is a bit mixed up.

That is a clear violation imho.

Copilot was trained on leaked internal Microsoft code that's on github at the moment. Anyway, everyone seems perfectly ok with training langauge models on copyright text.
Everyone is not perfectly OK with training language models on copyrighted text. It's just that evilCorps do it anyways, and there's nothing anyone can do to stop them. I can't do anything. At best, I could get a Twitter account and complain to the ether. The copyright holders can't do anything against the might evilCorps, but that doesn't make them okay with it. The fact you believe this is just sad, and exactly what evilCorps want from you.

This goes beyond fair use or satirical/comedic effect. They are training their models to output text in the style of the authors being absorbed. The style of is exactly the artistic effect that is being copyrighted.

Could you explain why you think training models on copyrighted text is illegal or copyright infringement or whatever else it might be?
Training the models is fine. Applying the models, which reproduces copyrighted works without proper attribution, is where it gets sticky.
My explanation will not be popular here on HN, but I'm never one to shy away. Especially when asked directly.

Buying a book, buying an audio CD, or buying a DVD/Blu-ray is granting the holder permission to read,listen,view that product as a single instance. You can lend them out, but that's all you're really allowed to do with them. The text,audio/video is not owned by you to do with as you please. People obviously do not like that, and argue making copies/backups is their right. Maybe that's acceptable, but we can agree posting them on torrents and sharing in any other manner from a copy made from the thing you have is not.

Saying that, training a model on someone's copyrighted text is not part of the agreement of the usage of said text whether it's a copyrighted magazine, newspaper, or book. If the people doing the training reach out to the copyright holders and get specific permission to use their copyrighted material in such a manner, then go ahead. The fact that people feel like they can do anything without the common courtesy of asking for permission is troubling to me that we've lost something as a society. There's no acknowledgment that someone has created something by their own work so that the creator can do with it as they please. A large portion of people believe that because it was created they deserve/should be able to/etc do what ever they want with someone else's creation. Including getting paid for derivitave works from the original creation.

> The fact that people feel like they can do anything without the common courtesy of asking for permission is troubling to me that we've lost something as a society.

I see this sentiment a lot in FOSS spaces but I don't really understand why. The role of judicial process _isn't_ to provide a guiding moral philosophy around social organization. Depending on the government in question that's either a role of government functions or isn't something that should be guided at all. The role of law often (and yes, not in all governments, but at least in the US) is to offer a contract between the state and the individual.

I understand the potential for abuse here in using Copilot to regurgitate licensed works without adhering to the terms of the work's license, but I'm not fluent enough in law to know if this is illegal or not. Calling out and specifically applying strict limits this practice is certainly something I'm sympathetic to, and I'm very curious to see what the courts come up with. But swayed by a moral argument I am not.

> People obviously do not like that, and argue making copies/backups is their right.

In some jurisdictions this is in fact their right by law as long as they own the original (the music/film industry of course used this as an excuse to slap additional fees on every sale of any storage medium). Redistribution is different however.

> My explanation will not be popular here on HN How is this better than ’bring on the downvotes’?

Moving on, I’ll put this to you: you claim training a ML model against copyrighted text is in violation of the ‘permission’ granted by the rights holder. However, flip this on its head for a moment – that’s basically all human brains do. Clearly, the greatest writers of our time haven’t written their works in a vacuum. Rather, that historical reading and inspiration becomes sufficiently obfuscated that we deem something adequately creative enough to be granted its own copyright.

Fundamentally, how does Copilot differ, other than perhaps being a poor implementation? Is it by not being ‘adequately creative’ enough? Is there some future version you could envision that would be, or is it the principle you’re arguing against?

If a trained language model exactly reproduces copyrighted text, is there any question about whether copyright still applies?
But then the infringement is done by the person who publishes that output, not by the text editor that copies the code.
This is a useless hypothetical, no language models do that
And yet there are plenty of examples of Copilot reproducing copyrighted code verbatim, like is does in this example[1] that was posted on HN.

[1] https://twitter.com/mitsuhiko/status/1410886329924194309

This is precisely what Copilot does, regularly.
The search engine on Github also calls up entire pages of GPL licensed code verbatim. Does it run afoul of copyright?
> Software licenses have barely been tested in court...

OSS licenses have been litigated and upheld. Can't supply details of my own experience for confidentiality reasons but plenty of plaintiffs have prevailed in suits about violations of OSS license terms. My guess is the numbers are higher than you might think because a lot of the cases end in non-public settlements.

A confidential settlement does not mean that a licence has been “tested in court” or “litigated and upheld.” It means the parties thought the risk of losing was high enough to justify a settlement. The state of the law remains uncertain because cases are getting settled rather than litigated.
Technically you are right, but the fact that defendants knew they were at high risk of losing means such licenses have teeth.
What about non-traditional-FOSS licenses? There is a lot of source-available not-OSI-compliant licensed software on GitHub like MongoDB, CockroachDB, etc., and that's clearly proprietary. If this thing is trained on that and generates what amount to snippets of that code then it's clearly violating those licenses.

Then there's private repositories. If they included those in the training data set that's even more actionable.

Personally I think this is software piracy at an absolutely unprecedented scale. Machine learning is just information transfer from the training data into weights in a model, a close relative of lossy data compression. Microsoft is now reselling all its GitHub users' code for profit.

Private repositories weren't included in the training data per-github, only public repos.

This really doesn't give me much comfort though. Making a repo public doesn't imply anything, it could be "All rights reserved".

> You're extremely overconfident about how this will actually play out.

I'd argue Microsoft too, was/is overconfident about how this would play out. I would have expected a little more caution on selecting the training data.

> it doesn't mean every single line in isolation is copyrightable.

copilot is known to reproduce entire blocks of text including non functional parts like comments.

While they are not tested, anything other than accepting the idea kills the idea of software completely. There is lots of room to change details, but somehow copyright and the fact that the code is copied into computer memory needs to be reconciled.
I don't see how. It might kill specific ideological licensing of software code, but the idea it'd kill software as a whole is pretty unbelievable. Software is too valuable to society.

As we're seeing, there's VERY little software where the specific algorithms or ideas in the software are what's valuable. The value comes from the ability to sell a service based on the software and operate it at scale. Like you said, how much SaaS is mostly open source stuff packaged up? Android is (sort of) open source, companies pay lots of people a lot of money to contribute to the Linux kernel where they give away the code they developed with that money, etc etc.

A software license, like any license, is a permission to operate.

> it doesn't mean every single line in isolation is copyrightable

It is if you can prove reproduction apart from your own original work (fair use). Unlike patents copyright doesn’t protect uniqueness. It is only a shield from reproduction, and if reproduction is demonstrable to a court you are likely at risk.

https://cws.auburn.edu/OVPR/pm/tt/copyrightvplagiarism