Hacker News new | ask | show | jobs
by reorder9695 49 days ago
The whole thing with GPL code seems like a mess and surely couldn't be set as actual precedent, right? It is totally infeasible for me to check every single GPL project on every code hosting platform to see if the code Claude etc produced is too similar. If a set of training data used for the model was released to check against that would be one thing, but you can't honestly expect someone to check every repo available from all time to see if a model (that you are not informed of what it was trained on and therefore could reproduce) might've reproduced code from it.

That's not at all like checking the dependency chain of a dependency or anything as you can just read the licence of anything you're choosing to use. Surely the precedent would have to be that a model trained on GPL code has itself been infected by GPL, and therefore must have all source/weights released too if the assumption here is that it can have embedded the code well enough to be able to reproduce it?

5 comments

> Surely the precedent would have to be that a model trained on GPL code has itself been infected by GPL, and therefore must have all source/weights released

I don't see how this follows, unless we also agree that humans who have ever read any GPL code are themselves permanently tainted and therefore cannot produce anything that isn't influenced even slightly by said code.

Is it just because we think the robot does a better job at learning than we do? It's an impossible line to draw, I agree, but I don't agree that the answer is "well then everything must be considered tainted," I say the answer is "ignore a vestigial concern of a bygone era."

The robot does a better job at reproduction. I don't think there exists a definition of "learning" unambiguous enough to make the claim that it learns better than humans. Specifically, published models don't learn at all -- after the training phase, the model weights are fully static.
Duplicating BSD-licensed code without copyright attribution and mention of the original license is just as much a violation of the original copyright -- that applies regardless of additional copyleft requirements imposed by the GPL. A different but no less serious restriction applies to all the code examples on MSDN: the license disallows using the samples in production code.

LLMs are effectively copyright laundering machines, and barring any indemnification clauses in the ToS (of course there are none), full liability lies with the user.

There's an easy solution... release your code as GPL :)

(but that doesn't protect you against GPL-incompatible copyleft licenses, I guess)

> It is totally infeasible for me to check every single GPL project on every code hosting platform to see if the code Claude etc produced is too similar.

I would say that choosing a tool that makes it infeasible doesn't actually excuse you from doing it.

> but you can't honestly expect someone to check every repo available from all time to see if a model [...] might've reproduced code from it.

Well, if you care about not violating any licenses, you could buy services from an LLM provider that was only trained on code in the Public Domain (or code that the LLM provider licensed for that purpose), and/or buy some kind of legal guarantee from the LLM provider that the code produced is "clean".

Of course, that'd be much more expensive than current offerings, but it would reflect the real cost of software development, not just YOLOing it, from a legal perspective.

When I wrote a book, part of the contract with my publisher was that I had to attest that I actually wrote the book myself, that quotes were properly attributed etc. If you buy code-writing services, why shouldn't it contain similar clauses?