Hacker News new | ask | show | jobs
by ThrowawayR2 148 days ago
Yes, it's been discussed many times before. All the corporations training LLMs have to have done a legal analysis and concluded that it's defensible. Even one of the white papers commissioned by the FSF ( "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" at https://www.fsf.org/licensing/copilot/copyright-implications... ), concluded that using copyrighted data to train AI was plausibly legally defensible and outlined the potential argument. You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
3 comments

> Even one of the white papers commissioned by the FSF

Quoting the text which the FSF put at the top of that page:

"This paper is published as part of our call for community whitepapers on Copilot. The papers contain opinions with which the FSF may or may not agree, and any views expressed by the authors do not necessarily represent the Free Software Foundation. They were selected because we thought they advanced the discussion of important questions, and did so clearly."

So, they asked the community to share thoughts on this topic, and they're publishing interesting viewpoints that clearly advance the discussion, whether or not they end up agreeing with them. I do acknowledge that they paid $500 for each paper they published, which gives some validity to your use of the verb "commissioned", but that's a separate question from whether the FSF agrees with the conclusions. They certainly didn't choose a specific author or set of authors to write a paper on a specific topic before the paper was written, which a commission usually involves, and even then the commissioning organization doesn't always agree with the paper's conclusion unless the commission isn't considered done until the paper is updated to match the desired conclusion.

> You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.

It could alternatively be because they discovered or reasonably should have discovered the copyright infringement less than three years ago, therefore still have time remaining in their statute of limitations, and are taking their time to make sure they file the best possible legal complaint in the most favorable available venue.

Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.

Since I very specifically wrote "commissioned by the FSF" instead of "represents the opinion of the FSF" to avoid misrepresenting the paper, you're arguing against something I have not said.
True, I was only arguing against something that you seemed to me to have implied, not anything you outright said.
> Even one of the white papers commissioned by the FSF [...] concluded that using copyrighted data to train AI was plausibly legally defensible [...] notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

I agree with jkaplowitz, but for a different reason I still believe that your description feels a bit misleading to me. The FSF commissioned paper makes the argument that Microsoft's use of code FROM GITHUB, FOR COPILOT is likely non-infringing, because of the additional github ToS. This feels like critical context to provide given in the very next statement, you widened it to LLMs generally, and the FSF which likely cares about code, not on github as well.

All of that said, I'm not sure it matters, because while I don't find the argument from the that whitepaper very compelling, because it's based critically on additional grants in the ToS. IIRC (going only from memory) the ToS requires that you grant github a license as it's needed to provide the service. Github can provide the services the user reasonably understood github to provide, without violating the additional clauses specified in the existing FOSS license covering the code. That being from a while ago, and I'd say it's very murky now, because everyone knows Microsoft provides copilot, so "obviously" they need it.

Unfortunately, and importantly, when dealing with copyrights, the paper also covers the transformative fair use arguments in depth. And I do find those following arguments very compelling. The paper, (and likely others) are making the argument that the code output from an LLM is likely transformative. And thus can't be infringing compelling, (or is unlikely to be). I think in many cases, the output is clearly transformative in nature.

I've also seen code generated by claude (likely others as well?) to copy large sections from existing works. Where it's clearly "copy/paste" which clearly can't be fair use, nor transformative. The output clearly copies the soul of the work. Thus given I have no idea what dataset they're copying this code from, it's scary enough to make me unwilling to take the chance on any of it.

So it's legal to train an "intelligence" on everything for free based on fair use, but it's not legal to train another intelligence (my brain) on it?
No, it's also not illegal to train your brain. If you break into a store, and read all the books, you'll get arrested for breaking and entering. Not for reading the books. My (superficial) take on the argument is that they're hoping by saying "it's not illegal to read" no one will notice, and no one will ask how they got into the book store to begin with.
So why is it illegal to download a pirated copy of a book from the internet to "train" my brain? There's no breaking and entering there, right?
The answer is in the name of the law, copyright, the right to produce a copy. The original, ethical intent behind the law was to encourage people to create things. Someone could invest time and money into creating some art that had value, and then they were given the exclusive right to monetize it for some amount of time. You could create something, and I'm not allowed to copy what you created, and sell it without your permission, preventing me from doing no work but capturing all the money you could reasonably make off your work.

Want to create a song? You're the only person allowed to make, or authorize people to duplicate it. You're the only person allowed to control the supply of your effort. Eventually, the public good, and interest was supposed to take over, because in the end, you're right, it's just information. It was supposed to enter "the public domain" where anyone could freely use it. But then Disney got involved, and now it's a toxified weapon used mostly by unethical lawyers against curiosity.

Because you are making a copy? Moreover, in some jurisdictions only uploading is illegal. Downloading is fine.
You're close to an important point.

Our current laws are written to make it legal for you to copy the Quran via your brain — some people learn it by rote and can stand up and speak the entire work from one end to the other. This is intended to be legal. Fair use of the Quran.

I went to a concert recently where someone copied every word and (as far as I could hear) every note from a copyrighted work by Bruce Springsteen. Singing and playing. This too is intended to be fair use.

You can learn how to play and sing Springsteen songs verbatim, and you can use his records to learn to sound like him when you sing, and that's intended to be legal.

Since the law doesn't say "but you cannot write a program to do these things, or run such a program once written", why would it be illegal to do the same thing using some code?

The people who want the law to differentiate have a difficult challenge in front of them. As I see it, they need to differentiate between what humans do to learn from what machines do, and that implies really knowing what humans do. And then they need to draw boundaries, making various kinds of computer-assisted human learning either legal or illegal.

Some of them say things like "when an AI draws Calvin and Hobbes in the style of Breughel, it obviously has copied paintings by Breughel" but a court will ask why that's obvious. Is it really obvious that the way it does that drawing necessarily involves copying, when you as a human can do the same thing without copying?

> I went to a concert recently where someone copied every word and (as far as I could hear) every note from a copyrighted work by Bruce Springsteen. Singing and playing. This too is intended to be fair use.

Only the learning part is fair use. Playing an artist's songs in public does not violate the copyright of the original performing artist, but it does violate the songwriters' copyright, and you do need a license to play covers in public.

They're called Performing Rights: https://en.wikipedia.org/wiki/Performing_rights

It can also violate other laws and rules that are not relevant to copyright. Perhaps I should have digressed into listing that? I chose not to.
Performing rights are part of copyright law and thus directly relevant to copyright. Stop dissembling.
What? I didn't know that. Do you have a reference? I'm particularly interested in the origin — is this something that applies to countries with a common law tradition, a roman law tradition, does it originate in one of the copyright treaties, etc. That kind of question.