Hacker News new | ask | show | jobs
by maep 1320 days ago
If this tool was trained on open source code, what license does the generated code have? At least with Codepilot people were able to generate verbatim GPL code with typos and everything. More importantly, I wonder if companies behind these type of tools offer legal or financial protections in case GPL code sneaks in and leads to expensive law suits.
3 comments

I mean people are also trained on GPL code and I bet you can find a ton of functions copies from GPL projects in million other projects.

But as long as these are tiny parts of codebase (which will most probably be the case), I doubt anything can be done with that. No one will go to court because of a few generic functions.

No they weren't able to generate the same existing code, both because that code is not included anywhere in the model, and because Copilot (not "Codepilot") has safeguards against this kind of situation, should it arise in the highly unlikely situation that a snippet is repeated thousands of times across thousands of repositories.

I've gotta let you know that people copy code snippets from all sorts of codebases with little regard for licenses anyway, because they're toothless in 99% of cases, AI or not. It's a nice illusion that anyone respects licenses, but it's just not true.

That's incorrect. CoPilot steals verbatim. Examples: https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...
I've spent hours looking over code before delivering to FAANG. Our company had put a clause into the contract that our code was free of any GPL'd code. It happened before and it was discoved. The whole thing was a very expensive excersice. I'm aware that many small startups, 90% of which go bust anyways, just ignore licenses but that doesn't work when you play with the big boys.
If you look at licensed code, then write new code, do you also bring in those licenses?

It's been proved in court that AI does not infringe on copyright or licenses since it generates things from an understanding of the whole, instead of directly stealing, just like the human brain does.

That is going to need a source. All I see in these AI data gathering exercises is that if the industry isn’t a well established litigious one, the companies will happily suck in all the data, license be damned. Code and art both fall under this. But when it comes to music which is heavily litigated, suddenly the only content a company like stable AI will use is open and voluntary because in that case they worry about “overfitting and legal issues”. (Refer Harmon ai)

Hypocrisy dressing up as progress in the machine learning field has been one of the most embarrassing scenes in software engineering recently. The genie may be out of the bottle but the fact is that a bunch of software engineers with a “move fast ethics later” attitude are the ones who let it out and they shouldn’t get to shrug it off for free.

>It's been proved in court that AI does not infringe on copyright or licenses since it generates things from an understanding of the whole, instead of directly stealing, just like the human brain does.

Do you have a source for that?

This SF Conservancy article[0] says that's not true:

>Consider GitHub’s claim that “training ML systems on public data is fair use”. We have not found any case of note — at least in the USA — that truly contemplates that question.

The first major court case I know about is the class-action case Matthew Butterick is trying to build.[1]

[0] https://sfconservancy.org/blog/2022/feb/03/github-copilot-co...

[1] https://githubcopilotinvestigation.com/

astonishingly enough every sentence in this post is untrue. There's been no court case on any of the models in question here. They don't work like human brains, nor understand anything they output. Even if they did of course that output would still be subject to licenses, given that human code is subject to them, which is why those licenses exist in the first place.

If you ever plan to steal someone's code and justify it with "my brain is able to learn, therefore copyright doesn't exist" I warn you right now this will not fly.

US Court rulings do not automatically apply worldwide, and not everything it would apply to exists within its jurisdiction.
> If you look at licensed code, then write new code, do you also bring in those licenses?

If the “new” code is close enough to be considered a derived work then you will need a license.

> If the “new” code is close enough to be considered a derived work then you will need a license.

And how is that determined... in court at trial? By an unbiased 3rd party competent enough to understand both codebases?

Same for all programmers out there. Copilot will need to be careful, and like with everyone else, they'll learn.