Hacker News new | ask | show | jobs
by SurgeArrest 1316 days ago
I hope this case will fail and establish a good precedent for all future AI litigations and may be even prevent new ones. Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application. If you don't like this, don't make your code open source. This was happening and is happening independent of any license all over the world by majority of developers. What Copilot and similar tools did was to make those snippets accessible for extrapolation in new applications.

If these folks win - we again throw progress under the bus.

7 comments

No thank you. I put a license to be followed, not to just be disregarded by an AI as "Learning material". No human perfectly reproduces their learning material no matter what, but Copilot does.
You mean to tell me that no one has ever perfectly replicated an example that they read somewhere? There's only so many ways to write AABB collision, fibonacci, or any number of other common algorithms. I'm not saying there aren't things to consider but I'm sure I've perfectly replicated something I read somewhere whether I'm actively aware of it or not
So are you ok with it being illegal for humans to learn from copyrighted books unless they have a license that explicitly allows learning? That does not sound like a pleasant consequence.
Would you use an AI text generator to write a thesis? No, there's a risk a whole chunk of it will be considered plagiarism because you have no idea what the source of the AI output is, but you know it was trained with unknown copyrighted material. This has nothing to do with the way humans learn, it's about correct attribution.

There is no technical reason why Microsoft can't respect licenses with Copilot. But that would mean more work and less training input, so they do code laundering and excuse it with comparisons to human learning because making AI seem more advanced than it is has always worked well in marketing.

Edit: And where do you draw the line between "learning" and copying? I can train a network to exactly reproduce licensed code (or books, or movies) just like a human can memorize it given enough time - and both of those would be considered a copyright violation if used without correct attribution. If you trained an AI model with copyrighted data you will get copyrighted results with random variation which might be enough to become unrecognizable if you're lucky.

> Would you use an AI text generator to write a thesis? No, there's a risk a whole chunk of it will be considered plagiarism because you have no idea what the source of the AI output is, but you know it was trained with unknown copyrighted material.

Of course, but that's a separate issue. We're not talking about whether the output of the AI is copyrighted. We're talking about whether it's ok for it to learn from copyrighted material.

Again you can say exactly the same about humans. I am perfectly capable of plagiarising or outputting copyrighted material. That doesn't mean it's illegal to learn from that material, just to output it verbatim.

So the fundamental issue is that it's harder to tell when an AI is plagiarising than it is when you produce something yourself. But that is a technical (and probably solvable) issue, not a legal one. And it's not the subject of this lawsuit.

Here's the thing - the US has well-established laws around copyright that don't consider learning from books a violation of those copyrights. This lawsuit is intended to challenge Copilot as a violation of licensing and isn't a litigation of "how people learn." Your program stole my code in violation of my license - there's a clear legal issue here.

I'd pose a question to you - would it be okay for me to copy/paste your code verbatim into my paid product in violation of your license and claim that I'm just using it for "learning"?

If you cherry picked sections of my code? I'd have no more issue with it then George R.R. Martin would care if you grabbed a few paragraphs out of one of his fantasy books and used them in your novel.
I think they're taking issue with the unauthorized duplication of copyrighted code. That's distinct from learning how to code (which I don't think anyone would claim Copilot is doing) which people get from reading a book. If you were to read the book only to copy it verbatim and resell it, you're going to have a bad time.
It's a pleasant consequence for the person who spent years becoming an expert and then writing the book. It's also a pleasant consequence for the people who buy the book, which might not have existed without a copyright system to protect the writer's interests.
AI are not humans, no human can read _all_ the code on Github. They certainly can't read _all_ the code on Github at the scale that MS can, and are unlikely to be able to extract profits directly from that code, in violation of the licensing.
I doubt it, but they'd probably be against people quoting copyrighted material verbatim without attribution in their own work after.
100% false, there are loads of historical cases of people with eidetic memories being able to reproduce things that they've seen with near complete fidelity, there's no reason to believe that a coder with such a memory would be any different.
> Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application.

Yes, but attribution should still be given. Just because you don't copy-paste someone else's creation doesn't mean you're licensed to use it.

Is it the role of the tool (in this case copilot) to include the license information? Or is it the responsibility of the organization using the code to make sure that it wasn't copied from somewhere?

What if, instead of a tool, you had a random consultant do some work, and it was found out that he asked a ton of stuff on Stack Overflow and copied the CC-BY-SA 4.0 answers into his work? What if it was then found out that one of those answers was based on copying something from the Linux kernel? Who is responsible for doing the license check on the code before releasing the product?

> Or is it the responsibility of the organization using the code to make sure that it wasn't copied from somewhere?

Do you know whether the code you got from Copilot has an incompatible license? No, so if you plan to use Copilot for serious projects you need it to include sources/licenses either way. In fact that would be a very helpful feature as it would let you filter licenses.

> irregardless of license

Hard no. Please stop using open source code if this is how you think of it.

Without licenses being respected, we don't get open source communities.

Licenses be damned, copyright law sits above it -- and for now, it's hard to see how this isn't fair use. The only case might be an open source Copilot alternative and GitHub and OpenAI can take any such projects out of the training set.
Open source does not mean public domain. Open source specifically attaches limitations on how the code may be reused.
There are no limitations on reading the code to learn from it.
Perhaps the lawsuit contends that Copilot isn't in fact learning how to code, but is rather regurgitating information it has managed to glean and statistically categorize, without any real understanding as to what it was doing?
> Your code is open source ....

So why MS can screw only with some licenses that you call "open source". Your example with a human reading a book would also work with code available licenses or decompiled binaries.

I would have been fine if the open source code was used to create an open model or if MS would have put his ass on the line and also train the model with all the GitHub code because they claim there is no copyright issue.

The problem is that copyright laws were introduced for a reason, and with a thinking similar to yours we might decide to get rid of copyright altogether, which I think is a bad idea.

P.S. I am not a lawyer.

If organisations are going to ignore the licenses attached to my OOS and that's legimitised in the law, then that's a surefire way to irreparably damage the open source ecosystem