Hacker News new | ask | show | jobs
by JoshTriplett 1217 days ago
Sadly, "simple license management" here just refers to "who in your organization has a license to use this tool", rather than "where did this code come from and what license is it under".

This tool remains the equivalent of money laundering for violation of Open Source licenses (or software licenses in general).

9 comments

Where's the part of the Copilot EULA that indemifies users against copyright infringement for the generated code?

If the model was trained entirely using code that Microsoft has copyright over (for example: the MS Windows codebase, the MS Office codebase, etc.) then they could offer legal assurances that they have usage rights to the generated code as derivative works.

Without such assurances, how do you know that the generated code is not subject to copyright and what license the generated code is under?

Are you comfortable risking your company's IP by unknowingly using AGPLv3-licensed code [1] that Copilot "generated" in your company's products?

[1] https://en.wikipedia.org/wiki/GNU_Affero_General_Public_Lice...

Ultimately I think MS is right betting on the fact that this will not matter in the long run. All AI tools are trained on copyrighted data. The tools are too useful to cripple with restrictive laws written for a previous era.
The pirate movement might have been a bit early for their time, but if AI happen to be the technology advancement needed for society to abolish copyright law then lets get it done.

With everything from scihub getting blocked and people getting subscription fatigue, it is an excellent time for the restrictive laws to be replaced.

> The pirate movement might have been a bit early for their time, but if AI happen to be the technology advancement needed for society to abolish copyright law then lets get it done.

No argument there; that'd be a huge victory.

But until that happens, tools for "smart completions" need to provide appropriate licensing and attribution metadata.

Please by all means have those laws repealed. Until then, a license violation remains a license violation no matter how much AI you run the code through.
Napster was right to bet on music being accessible online but oh wait
It’s not AI if it’s just copying and pasting someone else’s code
>> All AI tools are trained on copyrighted data. The tools are too useful to cripple with restrictive laws written for a previous era.

Are you sure that Disney and Games Workshop will not mind if make your own Mickey Mouse / Warhammer 40K cross-over movie for profit? (I would watch it!)

AI can help generate it, so it must be okay.

Those are protected by trademark law which will still be strong in a post AI world. I'm sure those companies will be making massive use of AI to speed up content production.
You previously stated "The tools are too useful to cripple with restrictive laws written for a previous era." and also "Those are protected by trademark law which will still be strong in a post AI world."

So do you believe in legal protections for trademarked and copyrighted works or not?

If an AI assistant generates content that includes copyrighted or trademarked elements is the content "too useful to be crippled by restrictive laws" or "protected by laws which will still be strong" ?

I agree. AI is happening. Don’t swim upstream.
Copilot for Business includes an $500k indemnity if you turn on excluding suggestions that match open source code https://github.com/customer-terms/github-copilot-product-spe...
Good to know, but there are caveats:

"GitHub’s defense obligations do not apply if (i) the claim is based on Code that differs from a Suggestion provided by GitHub Copilot, (ii) you fail to follow reasonable software development review practices designed to prevent the intentional or inadvertent use of Code in a way that may violate the intellectual property or other rights of a third party, or (iii) you have not enabled all filtering features available in GitHub Copilot."

Do they define what constitutes "reasonable software development review practices designed to prevent the intentional or inadvertent use of Code in a way that may violate the intellectual property or other rights of a third party" ?

>(iii) you have not enabled all filtering features available in GitHub Copilot

I was trying to find out what these exact filtering settings are but I land at this page [1].

The other links just download the Product Specific terms PDF again.

Not sure if this is an oversight or intentionally intended to be circular.

[1] https://docs.github.com/en/copilot/configuring-github-copilo...

I wonder if the procurement departments get to negotiate that higher for the bigger contracts, because that's low imo.
this seems like it covers non-business copilot as well
>Are you comfortable risking your company's IP by unknowingly using AGPLv3-licensed code that Copilot "generated" in your company's products?

This would not risk your company's IP.

>> This would not risk your company's IP.

"The GNU Affero General Public License is a modified version of the ordinary GNU GPL version 3. It has one added requirement: if you run a modified program on a server and let other users communicate with it there, your server must also allow them to download the source code corresponding to the modified version running there."

So your company is okay providing all their source code to users?

I know that some companies do this, but most do not.

>So your company is okay providing all their source code to users?

Nothing is forcing them to do that. If they are infringing they simply need to delete the infringing code and rewrite it by hand.

That's not how copyright infringement works. Great, you've stopped committing new infringements, but there's still a legal case over the previous infringements, for which you still need to provide appropriate remedy.

If the remedy for copyright infringement were just "oh, we got caught, guess we'll stop now", that would provide substantial incentive for people to violate licenses as long as they hoped they wouldn't get caught. The remedy for such violations needs to be substantial enough that it's not profitable to temporarily get away with.

The company would have to pay a fee to the copyright owner plus some extra.

I am jaded from practice around GDPR where the thinking of companies goes along the line: if we are caught, we will pay extra, but now we make big cash. And who knows, maybe we won't get caught.

>> Nothing is forcing them to do that.

If you use AGPLv3-licensed code in your codebase, you are agreeing to the terms of the license.

Practically all corporate legal teams for companies that creates software will strictly prohibit the use of AGPLv3-licensed software, if not all GPL-based licenses.

Using AGPL code doesn't mean you agree to the terms. It means if you don't agree to and obey the terms, you don't have a license, which is copyright infringement.
It's an unpopular opinion which is why I'll cowardly write it under a throwaway account. Josh, I have a ton of respect for your work just btw. I can't help but see a headline like this and think "Okay the license argument has to be the top comment already".

To me this whole thing is like pandora's box and it will not in any way be put back into the box. In the long run isn't arguing about the code it generates and how it generates it mostly tilting at windmills? I've already met new / junior programmers that have used copilot and chatgpt to help them see how to approach certain problems or try to get better framing for what they couldn't quite get into the most accurate words to google.

I too would prefer these tools embody the ideal: no license violation, perfect citation of where the archetypes of the code came from. I've commented here today (amongst some great FOSS software engineers) to see if a genuine respectful conversation can be had about how just like torrents this one isn't going to be put back in the box no matter how many legal precedents attempt (or succeed) in cutting off heads of the hydra. It's utility seems like it will steamroll any attempts to stop or slow it down.

Am I wrong? Is it a fools errand to ask?

> It's utility seems like it will steamroll any attempts to stop or slow it down.

What? I don't see any utility outside of education and even there it's pretty sketchy.

For business, legal compliance is not a joke and instantly shuts it down. The only businesses willing to use ChatGPT for generating code would be naive young startups who don't realize some assembly is still required and the instructions are missing no matter how much they query the bot. That's called expertise (which they don't yet have). It's not good enough to just write the code. Someone has to comprehend it so they can tweak it as needed. At some point the tweaks will become unwieldy and require actual software engineering that the bot doesn't know how to do (transform from one design pattern to another and know which to use). More power to them if they can cobble something together and then succeed at maintaining it. By the time they're through they'll have pulled off so many miracles that they won't need the bot anymore and become experts. That's quite the trial by fire, but hey everyone has to find their way!

I'm not saying "put it back in the box", I'm saying fix it to actually track Open Source licenses and provide attributions.

I'd have no objections to a tool that generated suggestions that came with attributions and license metadata, ready to insert into your project's file for third-party licenses. AI code suggestions are impressive.

I have objections to a tool that generates derived works from code without respecting the licenses of that code. For permissively licensed Open Source code, including that code without attribution deprives authors of their due credit (said credit often being how people get employment or funding). For copyleft Open Source code, including that code without using a compatible license violates the conditions upon which people made that code available for others to build upon and share. For proprietary code, including that code at all incurs legal risks.

I can understand if people don't agree that it's copyright infringement, and will use these tools on that basis. However, rolling over on issues you're passionate about because they're difficult to address? Well, if everyone was like that, nothing would ever change.
Agreed. Copyright is not a fundamental law of physics, its something we invented to help incentivize creation. The moment AI tools show to help spur creation, they are now more useful than copyright so society will simply rewrite the laws to adapt.
As an individual do you think you gone out the other end of this with freer abilities vs current copyright restrictions? And up against bigger players than yourself
OpenAI/GPT products at MS making same exact bet
Yeah we were told not to use it by the lawyers at work and have an official policy against using it. Not having that would open us up for liability if we’re sued as there’s no defence that what we did was clean room if we admitted using it.

We’ll hang back until other companies have litigated their way to some legislation around it.

Same here. Told to steer clear of this (in fairness, not only CoPilot, also ChatGPT and stuff), until somebody else pipeclean this in the courts.

No matter what is said, there are no license guarantees on the generated code, as you don’t know the exact provenance, so it seems only sensible to be on the safe side.

The lawyers on my last job were terrified to know that we store tracking information on website visitors computers that is used to track them by third party corporations.
Yes, I can imagine. It’s all fun and games until you get something that sticks on generally available media/press for example Cambridge Analytica sort of stuff.
I don't understand why they aren't tagging data with license information and allowing users to use models that don't include certain licenses - seems like it would be the middle ground given the stance they've taken; like, "we don't think it's a problem, but if this makes you feel better you can use these other models that specifically don't train on gpl code, or whatever"

I would prefer to see full license attributions included in generated responses, though. Something that then also wouldn't be that difficult to generate a licenses file from?

Amazon's CodeWhisperer has a "reference tracker" that tells you the license of training data code if the generated response is within some similarity threshold, but that's still not good enough imo.

> I would prefer to see full license attributions included in generated responses, though. Something that then also wouldn't be that difficult to generate a licenses file from?

Exactly. By all means build tools like this, but build them to actually comply with Open Source licenses. Provide a list of the licenses you don't mind copying from, and get back attributions with your suggestions.

Suppose Copilot offered some pure-MIT licensed flavor.

Copilot could comply with MIT licenses by just outputting an MIT license with ALL the authors of code used in training.

That'd be a valid solution, if impractical. I doubt that people would be willing to copy hundreds of thousands of license notices into their project.
One perspective is that those authors actually contributed to the end result.

But sure, disk size could be a problem.

> One perspective is that those authors actually contributed to the end result.

They absolutely did, yes. The approach you're suggesting would work from a legal perspective, but the size might pose practical problems.

> Amazon's CodeWhisperer has a "reference tracker" that tells you the license of training data code if the generated response is within some similarity threshold, but that's still not good enough imo.

I don't think it's possible to do better than that with this technology.

like I probably don't understand this in the right way, but I could have sworn we had the ability to probe latent space on models like these and make mappings based on them? Or was that only for diffusers?
Sure, you could build an index of attribution <> latent space coords, but it would not be clear whether a generated document near several index entries would require compliance.

I guess this is where the threshold comes from. Choose a generous margin and over-attribute rather than under-attribute.

You make it sound like Copilot is just copy-pasting something from a single repo. The code Copilot generates for me is extremely specific to my application. It understands the context, my code style and what I'm trying to do.

The result looks like my own code and is utilizing the already existing parts of my application. The code it writes for me solves problems that you cannot find a standard solution for anywhere and is definitely not something that could be attributed.

How Copilot is trained is an issue but answering the question "where did this code come from and what license is it under" would be impossible.

> You make it sound like Copilot is just copy-pasting something from a single repo.

Not at all. I'm saying it's derived from large amounts of code, without respecting the licenses on that code.

> How Copilot is trained is an issue but answering the question "where did this code come from and what license is it under" would be impossible.

Then it shouldn't exist outside of demos of what could exist in the future if the showstopper legal problem gets solved. Let's get people treating that constraint as business-critical and start coming up with clever solutions, and see how long "impossible" lasts.

How is training on public source code a legal problem? Can you provide some links for that claim?
What a bunch of pure fear mongering nonsense. The code that is produced by copilot indistinguishable from the rest of the code in the code base. Get over yourself.
I'd like to see people try to cite the sources for the code they wrote. It's highly improbable that without looking at anyone else's work in their lives, they would have created anything remotely similar to what they produced.
I open source my code specifically so that it can be re-used, ideally even for cases like this. To me, software freedom is the ability for it to be used for effectively any useful purpose, so that others don't have to do the same work again.

I understand that some folks don't believe the same thing, and use copyleft licenses so that their code can't be re-used in a closed way, and that's fair. Github shouldn't be training their product on copyleft licenses.

It's fair to call out its misuse of certain licenses, but "the equivalent of money laundering for violation of Open Source licenses" is simply inaccurate, as many licenses allow this type of re-use explicitly.

Do you license your code under a license that doesn't require any form of attribution or preservation of copyright notices? (Even permissive Open Source licenses typically do require that.)

If you do, then by all means they're welcome to use it without attribution or preservation of copyright notices, per the terms of the license you used.

But for all the Open Source code, even permissive Open Source code, that does require attribution or preservation of copyright notices, that's still a license violation. People don't often think of permissive Open Source licenses as something that can be violated, but they absolutely can be.

Indeed, that is why I don't use it either.

Double-checking whether the generated part is a verbatim copy negates the speed advantage.

Possible infringements from similarity are even harder to search.

Yes, thank you this is exactly what I was looking out for in the announcement.

Was looking for a way to instruct CodePilot to abide by the following rules:

- Only use Apache v2, MIT or BSD licensed work for its recommendations. (Or a specific license set)

- Only use code trained on public repositories.

- Provide code attributions of the source code where the recommendations originate from.

I'm not sure if the last point is possible given these GPT type architectures but it would really help during code reviews.

> Only use Apache v2, MIT or BSD licensed

Even if you use code under these license, you are still supposed to credit the authors by reproducing the license. So you need to know where it came from. Do you credit all the software the model was trained on?

> Do you credit all the software the model was trained on.

That's a brilliant idea everyone can just copy and past the exact same attribution file and be done with it.

"Just" copy paste a multi GB file with endless lines of authors? Brilliant? I might be missing something here.
We're not using any ML based machine generated source code.

The remaining use of OSS code has attributions.

> This tool remains the equivalent of money laundering for violation of Open Source licenses

That's what a good chunk of people do anyway at work. No one really cares nor will care. We were already moving in that direction anyway this will just accelerate it.