Hacker News new | ask | show | jobs
by chartreusek 1819 days ago
Sure, but there's still the license at play here. It's not like they trained it only on public domain/CC0 code. What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution. It can create unintended copyright violations and potentially open people using it up to liability.
4 comments

>What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution

You would sue.

And then Github would argue that their algorithms did not spit out verbatim the code by copying but rather it generated code that looked exactly like the other code based on learning from millions of codebases. ¨

And then there would be lots of lawyers.

And then a judge would have to decide.

And the judge would really not care about the "we did not copy it, we made an algorithm that created the exact code" technicality. It's their job to see through such things and consider the case at a higher level.

So the judge would look at two pages of exactly the same code and then decide whether the "not really copied" part is big enough to be considered an original work or not. If it is big enough it is a copyright violation. Nobody cares that you used an algorithm in between, you took the original as an input and ended up with exactly the same thing as an output, copyright violation, case closed...

But it would still potentially have used code not licensed for commercial use as a data set for a commercial product, which is problematic.

GitHub really needs to clarify which code was allowed for inclusion here. Until then we're can only speculate And enumerate potential scenarios.

I suppose it really depends on if they spit out verbatim reproductions of code or whether it is the equivalent of a 10-year experienced programmer who has just seen a lot of code but isn't reproducing anything verbatim.

We shall see, by Googling some of the code it spits out.

FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.

Copilot spitting out fast inverse square root, verbatim: https://twitter.com/mitsuhiko/status/1410886329924194309

HN discussion: https://news.ycombinator.com/item?id=27710287

I have worked a bit with transformers, the model underlying GPT. They absolutely learn to copy training data, and that’s perfectly normal.

What is happening here is we’re running into exactly what modern ML is NOT capable of: deductive reasoning. It does not think “I need to query the Twitter API for some posts, then filter them. Right, the API works like this…” No. It doesn’t think at all. It is a regression machine. “This sequence begins/looks like something I have seen before, here’s the corresponding output modulo adaptations.”

ML does not self-reflect, question motives and analyse causes. It’s just a complete lie to suggest otherwise, and to call this “pair programming”? What an absolute joke. It’s a lot like Tesla calling its glorified lane keeping an autopilot.

> FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.

But it does spit out whole paragraphs at a time. This is easy to test by going to any of the GPT-2/3 playgrounds on-line (e.g. AI Dungeon), and playing with prompts. Very specific prompts work best, but sometimes even with a generic prompt, if you let the model continue on its own past the first output, it might just shunt itself into a path where following the most probable continuations happens to reproduce a substantial portion of some work verbatim.

Maybe they should train the ML to read the license? If the ML can undertand the license, then we'll have to bow down to their superiority. However, if it did understand the license, then it would do the right thing.
So sue them and a court opinion can demonstrate where the line is and how much code can be replicated before attribution is required (and the product can be refined to ensure compliance).

Innovation should push boundaries.

They could push boundaries and publish one trained on all of Microsofts internal source code. Would for me be a great demonstration that they believe the "it's fair use and not violating copyright on the training data" argument.
It's more likely they'd sue someone who used it to develop something that ate into their lunch by saying it infringed on one of their 'secret' Linux patents they sabre rattle about every now and then.
Are you equipped to fight a protracted legal battle with Microsoft? Neither is anyone else.
The product is already dead. It's not just Microsoft that would be violating the license but any company using the application and Microsoft can't shield them.