Hacker News new | ask | show | jobs
by senorqa 294 days ago
> The copyright status of LLM-generated code is of concern to many developers; if LLM-generated code ends up being subject to somebody's copyright claim, accepting it into the kernel could set the project up for a future SCO-lawsuit scenario.

Ain't that anticipatory obedience?

1 comments

Yes, but two fold.

There is no reason why I can't sue every single developer to ever use an LLM and publish and/or distribute that code for AGPLv3 violations. They cannot prove to the court that their model did not use AGPLv3 code, as they did not make the model. I can also, independently, sue the creator of the model, for any model that was made outside of China.

No wonder the model makers don't want to disclose who they pirated content from.

Isn't it up to you to prove the model used AGPLv3 code, target then for them to prove they didn't?
Not inherently.

If their model reproduces enough of an AGPLv3 codebase near verbatim, and it cannot be simply handwaved away as a phonebook situation, then it is a foregone conclusion that they either ingested the codebase directly, or did so through somebody or something that did (which dooms purely synthetic models, like what Phi does).

I imagine a lot of lawyers are salivating over the chance of bankrupting big tech.

The onus is on you to prove that the code was reproduced and is used by the entity you're claiming violated copyright. Otherwise literally all tools capable of reproduction — printing presses, tape recorders, microphones, cameras, etc — would pose existential copyright risks for everyone who owns one. The tool having the capacity for reproduction doesn't mean you can blindly sue everyone who uses it: you have to show they actually violated copyright law. If the code it generated wasn't a reproduction of the code you have the IP rights for, you don't have a case.

TL;DR: you have not discovered an infinite money glitch in the legal system.

Yes! All of those things DO pose existential copyright risks if they use them to violate copyright!. We're both on the same page.

If you have a VHS deck, copy a VHS tape, then start handing out copies of it, I pick up a copy of it from you, and then see, lo and behold, it contains my copyrighted work, I have sufficient proof to sue you and most likely win.

If you train an LLM on pirated works, then start handing out copies of that LLM, I pick up a copy of it, and ask it to reproduce my work, and it can do so, even partially, I have sufficient proof to sue you and most likely win.

Technically, even involving "which license" is a bit moot, AGPLv3 or not, its a copyright violation to reproduce the work without license. GPL just makes the problem worse for them: anything involving any flavor of GPLv3 can end up snowballing with major GPL rightsholders enforcing the GPLv3 curing clause, as they will most likely also be able to convince the LLM to reproduce their works as well.

The real TL;DR is: they have not discovered an infinite money glitch. They must play by the same rules everyone else does, and they are not warning their users of the risk of using these.

BTW, if I was wrong about this, (IANAL after all), then so are the legal departments at companies across the world. Virtually all of them won't allow AGPLv3 programs in the door just because of the legal risk, and many of them won't allow the use of LLMs with the current state of the legal landscape.

No. You don't have sufficient proof to sue me simply for using an LLM, unless I actually use it to reproduce your work. If I don't use it to actually reproduce your work, you lose. And the onus is on you to prove that I did. Your claim was:

There is no reason why I can't sue every single developer to ever use an LLM and publish and/or distribute that code.

Simply proving that it's possible to reproduce your work with an LLM doesn't prove that I did, in fact, reproduce your work with an LLM. Just like you can't sue me for owning a VHS — even though it's possible that I could reproduce your work with one. The onus is on you to show that the person using the LLM has actually used it to violate your copyrighted work.

And running around blindly filing lawsuits claiming someone violated your copyright with no proof other than "they used an LLM to write their code!" will get your case thrown out immediately, and if you do it enough you'd likely get your lawyer disbarred (not that they'd agree to do it; there's no value in it for them, since you'll constantly lose). Just like blindly running around suing anyone who owns a VHS doesn't work. You have not discovered an infinite money glitch, or an infinite lawsuit glitch.

If you think you have, go talk to a lawyer. It's infinite free money, after all.

I think you are confused about how LLMs train and store information. These models aren't archives of code and text, they are surprisingly small, especially relative to the training dataset.

A recent anthropic lawsuit decision also reaffirms that training on copyright is not a violation of copyright.[1]

However outputting copyright still would be a violation, the same as a person doing it.

Most artists can draw a batman symbol. Copyright means they can't monetize that ability. It doesn't mean they can't look at bat symbols.

[1]https://www.npr.org/2025/06/25/nx-s1-5445242/federal-rules-i...