People who break the social contract are the ones responsible for breaking the social contract, not the ones who take steps in response to social contract being broken.
The contract behind open source was something like (GPL):
"If you copy my work, you should share your work too."
or at minimum (MIT):
"If you copy my work, you should credit me."
I think it is no longer under dispute that the legal contract is satisfied by LLMs. The AI companies won and will continue to win.
But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way. What did the authors mean by "copy"? Did they mean literally CTRL+C, CTRL+V or something broader?
This is a matter of opinion which only each individual creator can answer. For me, copying meant something like:
"To reproduce the function of my work, dependent on my having published it, without effort nor understanding of your own"
Ten years ago this basically required doing a CTRL+C, CTRL+V so there was no need to be more specific. Anybody who did enough work to, say, rewrite in another language (with that language's idioms), met the bar of clause 3. Now AI enables a form of "copying" that matches my definition, without the user even being aware of whose works they are copying. It perfectly launders the origins of its output. It can write an FFmpeg clone in Rust for you that would appear to be a novel work.
Of course, I cannot say that my own little bits and pieces of open source code would make a scratch in AI's capability, were it removed.
But I do strongly believe that if all the code that was published by authors with the same mindset was unavailable, Claude would be a far weaker developer.
> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.
Perhaps this illustrates a fissure that was always lurking under the surface, then. The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate, and that licenses like the GPL were meant to use copyright law to achieve something that resembles the state of affairs that might exist if copyright didn't exist in the first place. That's what the whole concept of "copyleft" always seemed to imply.
Now we have a new class of technologies that is admittedly fraught with a wide range of risks and pitfalls, but also a lot of promise to enable people to actually put the "four freedoms" into practice in ways they couldn't before, and we're seeing people who have normative opinions about AI derived from other, unrelated principles trying to circle the wagons and exclude those use cases. That is what seems like a breach of the social contract as I've always understood it.
> Did they mean literally CTRL+C, CTRL+V or something broader?
Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else. "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to, and the whole point of "copyleft" was to lessen the restrictions on even that.
> "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to
This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.
A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a "derivative work".
A training set is just an anthology, and the training process is condensation. That makes the weights a derivative work of every work in the training set.
Now, there's a separate discussion to be had about whether that derivative work meets the criteria for fair use, but that's it's own tangent.
> This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.
A derivative work is a work that itself includes copyrighted content from the original work.
That is to say that for something to be a derivative work, some measure of its content must be "CTRL-C, CTRL-V" from the originating work.
Something that's merely inspired by another work, or draws underlying themes or factual knowledge from it, is not a derivative work.
> A training set is just an anthology,
Which might make the training set itself a derivative work, but works created by using the model trained on that anthology are a different matter.
> and the training process is condensation.
No, it isn't. It's the creation of a new work that represents patterns extrapolated or interpolated from the data set, without the resulting model actually including any of the copyrighted elements of the work.
The underlying ideas and facts in the original work were never protected by copyright. Only the specific fixed form of expression is copyrightable.
Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.
> Perhaps this illustrates a fissure that was always lurking under the surface, then(...)
Yes, I do think there has always been such a fissure. People publish OSS code for many reasons, often a blend of multiple reasons. There are selfish reasons such as the desire for one's work to be recognized, or even the hope of getting better employment through showing ones' skill or making something companies will pay for support on. There are social reasons like the desire to collaborate with others. There are altruistic benefit-of-all-mankind reasons like Richard Stallman said "...restrictions reduce the amount and the ways that the program can be used. This reduces the amount of wealth that humanity derives from the program."
It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style. But even adherents of that style, I think, are not too happy with AI consumption of their code. For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared. And for two, if your idea of OSS is that we all put our contributions into the great shared river of human achievements to benefit the world, it is disappointing to see that river funneled into a giant waterwheel of profit for a half dozen trillion dollar companies charging rent for its bounty.
> Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else.
I agree from a legal standpoint. I cannot enforce my personal definition of copying nor do I expect that to become possible. It was just conveniently aligned with the reality of how copying software worked in the past, and no longer is and never will be again. That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.
> It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style.
No, I'm well aware of the different motivations for and approaches to FOSS. I'm mostly focusing on the copyleft/GNU GPL side of the discussion here because that's the side of the house where most of ideas of a social contract and desire to see a specific ecosystem develop have been located. People on the MIT/BSD side of things, which has always had a much more direct "do whatever you want" ethos, are not the ones I'd expect to be making these arguments in the first place.
> For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared.
I'd agree that someone using an LLM to create a deterministic transcription of someone else's work is indeed violating the license. But I think the argument goes beyond that, into using LLMs in any way at all.
> That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.
That's a reasonable position, and from the perspective of examining whether the current LLM climate is sapping motivation to participate in FOSS, I can understand where you're coming from.
But to that point, I'd argue that if your motivation was to gain recognition, participate in a community, etc. then you're going to lose those things by keeping your code private anyway, whereas you won't necessarily lose those things just because an LLM was trained on your code. If you contribute to a popular project, people were almost certainly already using your work to do things you don't approve of -- if that didn't take away your motivation, why would LLMs do much worse?
> The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate,
That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").
Which part of which open source licenses gave you the impression that there were no restrictions?
> That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").
These are restrictions on redistribution, not use. And they're there to make sure that derivative works can't themselves impose restrictions on use.
> "If you copy my work, you should share your work too."
Not exactly. The GPL way is that you should share my work under the same terms if you want to share it, even if modifying it.
You are not required to share anything if you don't actually share anything, and just run it yourself. That's where all the criticism towards cloud providers who freely use FLOSS is directed.
> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.
There is clearly a misalignment in expectations from some FLOSS enthusiasts. The main FLOSS licenses focus exclusively on distribution, but their expectations somehow extend well beyond distribution. We hear those FLOSS enthusiasts criticize and attack companies for using software exactly according to their terms, and somehow that is framed as abuse if said users happen to be bigger than some arbitrary boundary.
No one consented to training llms, as the op clearly implies, if they had been asked they would have declined to do so. As would all of the many copyright holders who are in the process of suing the model companies.
Are you asking how AI coding agents, the companies selling them and the individuals using them break the FOSS social contract (copyleft, attribution, upstreaming), or are you disputing that they do?
There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code. I've yet to encounter any strong arguments substantiating this premise as a general principle, and my own suspicion is that it is not valid as a general principle, given the nature of how LLMs operate.
It's certainly possible that specific instances of LLMs lazily copy-pasting code from public repos may exist, and the extent to which this is happening is something that can be substantiated by empirical examples, so if you have any to point to, I'd be interested in looking at them. However, where this is happening, it ought to be regarded as a failure modality of LLMs, and not something that implicates the underlying nature of LLMs, given that their intended purpose is to function as stochastic generators that do not merely copy-paste input data.
My initial feeling here is that using open-source code to train LLMs is not per se a violation of the generally accepted FOSS social contract, but rather that attempting to restrict specific use cases of FOSS-licensed code on the basis of normative opinions unrelated to the license terms is a violation, or at least a rejection, of that social contract. I'm not fully committed to this position, though, and would welcome well-reasoned arguments to the contrary.
Yes but my answer would be different. It can be either about what coding agents do (and you'll see that it breaks the social contract), or it can be about what the FOSS social contract is (and you'll argue that coding agents don't break it.) Lo and behold, it was the latter.
> There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code.
Not any work. But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.
There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?
The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.
AI gives the user all the benefits of using open source software with none of the obligations that come from using open source software. Developer gains nothing from going open source. It makes no sense for any developer to go open source. Social comtract breaks down, and it's all because AI users didn't hold up their half of the bargain.
> But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.
I don't disagree with the premise that any LLM that is cloning code wholesale from a third-party repo is creating a derivative work, and the license terms apply to it.
But I also don't agree that non-AI code generators such as transpilers are in the same category as LLMs -- a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.
> There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?
The model isn't a lossy compression archive that merely represents a collection of pre-existing works in parallel to each other. It's a probability matrix that relates together uniquely isolatable units of data to each other across the entire collection.
If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.
> The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.
I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.
There's always been both a social and legal obligation to properly attribute authors and abide by license terms when redistributing or forking FOSS code, but neither obligation has ever applied when learning programming techniques from FOSS code in order to write your own software. And the way LLMs are designed to work is more similar to the latter than to the former.
But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.
People who take steps in response to social contract being broken are the ones responsible for the steps they've taken, not the ones who break the social contract.
DDOSing websites seems to be an unrelated problem, and one that has traditionally been solved through response throttling and IP blocking.
Attribution is often required even on MIT or BSD licenses where code is being redistributed, either in original or modified versions, but that would relate to this discussion only to the extent that one regards using LLMs whose training data included a certain bit of code as itself constituting redistribution of that specific code -- but that in turn is a very debatable premise which really ought to be argued for, and not merely argued upon as though it is already generally recognized as true.
This is the very question under debate. Training LLMs on publicly available data is a novel situation, and neither law nor social opinion have settled a consensus on the subject.
Copyright maximalists like to borrow unearned moral weight for their position by conflating copyright infringement with "stealing", but this is not actually true in any legal sense. It's not clear that training an AI on publicly available data should even constitute copyright infringement, much less "stealing".
Are you now layering the old and tired "copyright infringement = stealing" argument on top of the still unsubstantiated premise that all LLM training is copyright infringement?