Hacker News new | ask | show | jobs
by Gormo 10 days ago
So the questions here are (a) is any generally accepted social contract actually being broken, and (b) if so, who are the ones who are breaking it?
4 comments

The contract behind open source was something like (GPL):

"If you copy my work, you should share your work too."

or at minimum (MIT):

"If you copy my work, you should credit me."

I think it is no longer under dispute that the legal contract is satisfied by LLMs. The AI companies won and will continue to win.

But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way. What did the authors mean by "copy"? Did they mean literally CTRL+C, CTRL+V or something broader?

This is a matter of opinion which only each individual creator can answer. For me, copying meant something like:

"To reproduce the function of my work, dependent on my having published it, without effort nor understanding of your own"

Ten years ago this basically required doing a CTRL+C, CTRL+V so there was no need to be more specific. Anybody who did enough work to, say, rewrite in another language (with that language's idioms), met the bar of clause 3. Now AI enables a form of "copying" that matches my definition, without the user even being aware of whose works they are copying. It perfectly launders the origins of its output. It can write an FFmpeg clone in Rust for you that would appear to be a novel work.

Of course, I cannot say that my own little bits and pieces of open source code would make a scratch in AI's capability, were it removed.

But I do strongly believe that if all the code that was published by authors with the same mindset was unavailable, Claude would be a far weaker developer.

> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.

Perhaps this illustrates a fissure that was always lurking under the surface, then. The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate, and that licenses like the GPL were meant to use copyright law to achieve something that resembles the state of affairs that might exist if copyright didn't exist in the first place. That's what the whole concept of "copyleft" always seemed to imply.

Now we have a new class of technologies that is admittedly fraught with a wide range of risks and pitfalls, but also a lot of promise to enable people to actually put the "four freedoms" into practice in ways they couldn't before, and we're seeing people who have normative opinions about AI derived from other, unrelated principles trying to circle the wagons and exclude those use cases. That is what seems like a breach of the social contract as I've always understood it.

> Did they mean literally CTRL+C, CTRL+V or something broader?

Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else. "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to, and the whole point of "copyleft" was to lessen the restrictions on even that.

> "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to

This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.

A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a "derivative work".

A training set is just an anthology, and the training process is condensation. That makes the weights a derivative work of every work in the training set.

Now, there's a separate discussion to be had about whether that derivative work meets the criteria for fair use, but that's it's own tangent.

> This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.

A derivative work is a work that itself includes copyrighted content from the original work.

That is to say that for something to be a derivative work, some measure of its content must be "CTRL-C, CTRL-V" from the originating work.

Something that's merely inspired by another work, or draws underlying themes or factual knowledge from it, is not a derivative work.

> A training set is just an anthology,

Which might make the training set itself a derivative work, but works created by using the model trained on that anthology are a different matter.

> and the training process is condensation.

No, it isn't. It's the creation of a new work that represents patterns extrapolated or interpolated from the data set, without the resulting model actually including any of the copyrighted elements of the work.

The underlying ideas and facts in the original work were never protected by copyright. Only the specific fixed form of expression is copyrightable.

Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.

> A derivative work is a work that itself includes copyrighted content from the original work.

If you put a GPL C program through Emscripten to run in a browser the output doesn't include the original C code but it's surely a derivative work.

> Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.

This is undoubtedly the core of the disagreement. Humans can learn from what they have seen, appreciate it, understand it, and draw on that experience in what they create. They do this without being considered ripoff artists, so why not machines that simulate the "same" thing automatically?

To me the answer is simply that humans are special. Human thought and human effort makes it creativity when a human does it, copying when a machine does it. It's a double standard I am perfectly willing to accept. I am unabashedly biased in this regard.

That may seem remarkably unfair to the machines, or like a cop-out. I just carved out a hardcoded special case for humans, and my whole philosophical reasoning is "because I said so". But how fair do we want to be? After all, if you want to treat a machine exactly like a human who learns from prior art to create new art, then the ownership of the new art would also belong to the machine. Not to the person who prompts it.

> If you put a GPL C program through Emscripten to run in a browser the output doesn't include the original C code but it's surely a derivative work.

Because it does include content from the original work -- this is just a translation, and isn't comparable to how LLMs work.

> To me the answer is simply that humans are special.

I don't disagree, but I also view LLMs as tools that extend human capacities and not autonomous entities unto themselves. LLMs are still just software, and can't really be regarded as anything other than instruments that humans use to broaden their capacity to see, appreciate, understand, and draw on that experience in what they create.

> That may seem remarkably unfair to the machines, or like a cop-out.

No, it's unfair to the humans. The machines are just tools that they use. The "double standard" is really a set of inconsistent standards applied to the same underlying moral agents.

> After all, if you want to treat a machine exactly like a human who learns from prior art to create new art, then the ownership of the new art would also belong to the machine. Not to the person who prompts it.

No, it always belongs to the person who prompts it. The machine is not a conscious entity, bears no intentions, and has no capacity to act on its own initiative. The machine is always just a tool that extends human capacity, as all machines always have.

For a good comparison here, we've never not credited a photographer as the author of a photograph. But the photographer is in a sense merely prompting the camera by framing the shot, selecting the exposure, adjusting the lighting, etc. -- the hard work in actually creating the photograph is being done by the camera itself, with the photographer playing no role in directly constructing the final image, and with the many of the qualities of the final image being determined by pre-existing features of the camera's functional design and components that the photographer also played no role in defining, apart from choosing which camera to use.

LLMs are like cameras in this way. And the fact that they rely on external data for model training no more disclaims the user as the author of the resulting work than looking things up in a dictionary or encyclopedia does the same for the author of an essay.

Perhaps the future will be less Idiocracy and more Futurama, with humans and robots living socially together.
> Perhaps this illustrates a fissure that was always lurking under the surface, then(...)

Yes, I do think there has always been such a fissure. People publish OSS code for many reasons, often a blend of multiple reasons. There are selfish reasons such as the desire for one's work to be recognized, or even the hope of getting better employment through showing ones' skill or making something companies will pay for support on. There are social reasons like the desire to collaborate with others. There are altruistic benefit-of-all-mankind reasons like Richard Stallman said "...restrictions reduce the amount and the ways that the program can be used. This reduces the amount of wealth that humanity derives from the program."

It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style. But even adherents of that style, I think, are not too happy with AI consumption of their code. For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared. And for two, if your idea of OSS is that we all put our contributions into the great shared river of human achievements to benefit the world, it is disappointing to see that river funneled into a giant waterwheel of profit for a half dozen trillion dollar companies charging rent for its bounty.

> Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else.

I agree from a legal standpoint. I cannot enforce my personal definition of copying nor do I expect that to become possible. It was just conveniently aligned with the reality of how copying software worked in the past, and no longer is and never will be again. That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.

> It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style.

No, I'm well aware of the different motivations for and approaches to FOSS. I'm mostly focusing on the copyleft/GNU GPL side of the discussion here because that's the side of the house where most of ideas of a social contract and desire to see a specific ecosystem develop have been located. People on the MIT/BSD side of things, which has always had a much more direct "do whatever you want" ethos, are not the ones I'd expect to be making these arguments in the first place.

> For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared.

I'd agree that someone using an LLM to create a deterministic transcription of someone else's work is indeed violating the license. But I think the argument goes beyond that, into using LLMs in any way at all.

> That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.

That's a reasonable position, and from the perspective of examining whether the current LLM climate is sapping motivation to participate in FOSS, I can understand where you're coming from.

But to that point, I'd argue that if your motivation was to gain recognition, participate in a community, etc. then you're going to lose those things by keeping your code private anyway, whereas you won't necessarily lose those things just because an LLM was trained on your code. If you contribute to a popular project, people were almost certainly already using your work to do things you don't approve of -- if that didn't take away your motivation, why would LLMs do much worse?

> The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate,

That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").

Which part of which open source licenses gave you the impression that there were no restrictions?

> That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").

These are restrictions on redistribution, not use. And they're there to make sure that derivative works can't themselves impose restrictions on use.

One correction: the point of copyleft was to explot the restrictions in order to ensure that it would be possible for everyone to copy the software.
> "If you copy my work, you should share your work too."

Not exactly. The GPL way is that you should share my work under the same terms if you want to share it, even if modifying it.

You are not required to share anything if you don't actually share anything, and just run it yourself. That's where all the criticism towards cloud providers who freely use FLOSS is directed.

> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.

There is clearly a misalignment in expectations from some FLOSS enthusiasts. The main FLOSS licenses focus exclusively on distribution, but their expectations somehow extend well beyond distribution. We hear those FLOSS enthusiasts criticize and attack companies for using software exactly according to their terms, and somehow that is framed as abuse if said users happen to be bigger than some arbitrary boundary.

No one consented to training llms, as the op clearly implies, if they had been asked they would have declined to do so. As would all of the many copyright holders who are in the process of suing the model companies.
Are you asking how AI coding agents, the companies selling them and the individuals using them break the FOSS social contract (copyleft, attribution, upstreaming), or are you disputing that they do?
Both would resolve to the same question, no?

There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code. I've yet to encounter any strong arguments substantiating this premise as a general principle, and my own suspicion is that it is not valid as a general principle, given the nature of how LLMs operate.

It's certainly possible that specific instances of LLMs lazily copy-pasting code from public repos may exist, and the extent to which this is happening is something that can be substantiated by empirical examples, so if you have any to point to, I'd be interested in looking at them. However, where this is happening, it ought to be regarded as a failure modality of LLMs, and not something that implicates the underlying nature of LLMs, given that their intended purpose is to function as stochastic generators that do not merely copy-paste input data.

My initial feeling here is that using open-source code to train LLMs is not per se a violation of the generally accepted FOSS social contract, but rather that attempting to restrict specific use cases of FOSS-licensed code on the basis of normative opinions unrelated to the license terms is a violation, or at least a rejection, of that social contract. I'm not fully committed to this position, though, and would welcome well-reasoned arguments to the contrary.

> Both would resolve to the same question, no?

Yes but my answer would be different. It can be either about what coding agents do (and you'll see that it breaks the social contract), or it can be about what the FOSS social contract is (and you'll argue that coding agents don't break it.) Lo and behold, it was the latter.

> There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code.

Not any work. But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.

There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?

The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.

AI gives the user all the benefits of using open source software with none of the obligations that come from using open source software. Developer gains nothing from going open source. It makes no sense for any developer to go open source. Social comtract breaks down, and it's all because AI users didn't hold up their half of the bargain.

> But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.

I don't disagree with the premise that any LLM that is cloning code wholesale from a third-party repo is creating a derivative work, and the license terms apply to it.

But I also don't agree that non-AI code generators such as transpilers are in the same category as LLMs -- a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.

> There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?

The model isn't a lossy compression archive that merely represents a collection of pre-existing works in parallel to each other. It's a probability matrix that relates together uniquely isolatable units of data to each other across the entire collection.

If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

> The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.

I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.

There's always been both a social and legal obligation to properly attribute authors and abide by license terms when redistributing or forking FOSS code, but neither obligation has ever applied when learning programming techniques from FOSS code in order to write your own software. And the way LLMs are designed to work is more similar to the latter than to the former.

But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.

>If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

If you write "To see or not to see, that is the question" about a person named Eyelet, who is going blind, how can you argue that it is NOT derivative of / borrowed from Hamlet? Yet that sentence is not in the work. Isn't that what LLMs essentially do? Tokenize, then substitute in new values for certain tokens, while retaining the general structure?

> a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.

There are stochastic compression algorithms (e.g. https://github.com/kaydotdev/sqic) and it would be insane to claim they don't produce derivative works. And as a general rule, a work based on multiple other works is derivative of all af them.

> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

No, but your generated text is also useless if you want to read Hamlet. The danger I'm speaking of is people generating Hamlets but paraphrased - that's a derivative, especially if you use an automated tool that got original Hamlet as its input. Except the Hamlet in question is the Linux kernel but not bound by GPL. Also, your Markov chain itself is a derivative work.

> I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.

True, but that fraction of a huge number is still big enough to be meaningful help. Plus the recognition. Most users respect the attribution clause. AI legally-distinct clones drop the fraction of helpers and the number of attributions straight down to 0. That changes the equation, what previously made sense now straight up doesn't.

> But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.

And because OpenAI et al. hold all the money and all the lawyers, the only way to hold them accountable is to stop publishing open source altogether. That's the only leverage OSS community has.

> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

Uh, that is exactly what a derivative work is. You literally specify that Hamlet is an input to your work. I believe you're conflating derivative with transformative. You're certainly creating a transformative derivation of Hamlet, but you are by definition creating a derivative work by training a Markov chain on the text of Hamlet.

The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under. Others argue that there's not an exact copy of the original source in the LLM's weights so by definition it must be a transformative work. I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.

Personally, I don't care one way or the other. I'm one of the folks that thinks software shouldn't be copyright-able in the first place.

> Uh, that is exactly what a derivative work is.

No, it isn't. A derivative work isn't something based on extracting underlying ideas or patterns from another work, it's something that includes copyrighted portions of the other work.

An annotated edition of Hamlet is a derivative work. A Cliff's Notes summary of Hamlet is a derivative work.

Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet. A list of word counts of popular works of literature that includes an entry for Hamlet is also not a derivative work. The Markov chain described above is not a derivative work.

> The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under.

And I would agree with them. An LLM that actually is outputting non-trivial code that matches a public project's code verbatim is engaging in copying, and not stochastic inference.

> I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.

It's a shame that the same fundamental questions have to be relitigated over and over again just because the contextual formalities and modes of expression have changed. I wonder how many of the legal cases are going to be copies or derivative works of previous ones.

Yes, and obviously: bots crushing servers in strict contravention of the robots.txt rules.