Hacker News new | ask | show | jobs
by Gormo 10 days ago
> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.

Perhaps this illustrates a fissure that was always lurking under the surface, then. The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate, and that licenses like the GPL were meant to use copyright law to achieve something that resembles the state of affairs that might exist if copyright didn't exist in the first place. That's what the whole concept of "copyleft" always seemed to imply.

Now we have a new class of technologies that is admittedly fraught with a wide range of risks and pitfalls, but also a lot of promise to enable people to actually put the "four freedoms" into practice in ways they couldn't before, and we're seeing people who have normative opinions about AI derived from other, unrelated principles trying to circle the wagons and exclude those use cases. That is what seems like a breach of the social contract as I've always understood it.

> Did they mean literally CTRL+C, CTRL+V or something broader?

Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else. "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to, and the whole point of "copyleft" was to lessen the restrictions on even that.

4 comments

> "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to

This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.

A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a "derivative work".

A training set is just an anthology, and the training process is condensation. That makes the weights a derivative work of every work in the training set.

Now, there's a separate discussion to be had about whether that derivative work meets the criteria for fair use, but that's it's own tangent.

> This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.

A derivative work is a work that itself includes copyrighted content from the original work.

That is to say that for something to be a derivative work, some measure of its content must be "CTRL-C, CTRL-V" from the originating work.

Something that's merely inspired by another work, or draws underlying themes or factual knowledge from it, is not a derivative work.

> A training set is just an anthology,

Which might make the training set itself a derivative work, but works created by using the model trained on that anthology are a different matter.

> and the training process is condensation.

No, it isn't. It's the creation of a new work that represents patterns extrapolated or interpolated from the data set, without the resulting model actually including any of the copyrighted elements of the work.

The underlying ideas and facts in the original work were never protected by copyright. Only the specific fixed form of expression is copyrightable.

Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.

> A derivative work is a work that itself includes copyrighted content from the original work.

If you put a GPL C program through Emscripten to run in a browser the output doesn't include the original C code but it's surely a derivative work.

> Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.

This is undoubtedly the core of the disagreement. Humans can learn from what they have seen, appreciate it, understand it, and draw on that experience in what they create. They do this without being considered ripoff artists, so why not machines that simulate the "same" thing automatically?

To me the answer is simply that humans are special. Human thought and human effort makes it creativity when a human does it, copying when a machine does it. It's a double standard I am perfectly willing to accept. I am unabashedly biased in this regard.

That may seem remarkably unfair to the machines, or like a cop-out. I just carved out a hardcoded special case for humans, and my whole philosophical reasoning is "because I said so". But how fair do we want to be? After all, if you want to treat a machine exactly like a human who learns from prior art to create new art, then the ownership of the new art would also belong to the machine. Not to the person who prompts it.

> If you put a GPL C program through Emscripten to run in a browser the output doesn't include the original C code but it's surely a derivative work.

Because it does include content from the original work -- this is just a translation, and isn't comparable to how LLMs work.

> To me the answer is simply that humans are special.

I don't disagree, but I also view LLMs as tools that extend human capacities and not autonomous entities unto themselves. LLMs are still just software, and can't really be regarded as anything other than instruments that humans use to broaden their capacity to see, appreciate, understand, and draw on that experience in what they create.

> That may seem remarkably unfair to the machines, or like a cop-out.

No, it's unfair to the humans. The machines are just tools that they use. The "double standard" is really a set of inconsistent standards applied to the same underlying moral agents.

> After all, if you want to treat a machine exactly like a human who learns from prior art to create new art, then the ownership of the new art would also belong to the machine. Not to the person who prompts it.

No, it always belongs to the person who prompts it. The machine is not a conscious entity, bears no intentions, and has no capacity to act on its own initiative. The machine is always just a tool that extends human capacity, as all machines always have.

For a good comparison here, we've never not credited a photographer as the author of a photograph. But the photographer is in a sense merely prompting the camera by framing the shot, selecting the exposure, adjusting the lighting, etc. -- the hard work in actually creating the photograph is being done by the camera itself, with the photographer playing no role in directly constructing the final image, and with the many of the qualities of the final image being determined by pre-existing features of the camera's functional design and components that the photographer also played no role in defining, apart from choosing which camera to use.

LLMs are like cameras in this way. And the fact that they rely on external data for model training no more disclaims the user as the author of the resulting work than looking things up in a dictionary or encyclopedia does the same for the author of an essay.

The camera analogy is a good one but I have never had a camera that had every great picture somebody else had taken, plus every work of art, baked into it. They only captured what they were aimed at directly by the user. Well, maybe next time I upgrade my phone that will not be the case since they now have built in AI "enhancement" of photos.

I agree with the framing of the AI as a tool not an autonomous entity. The thing is, to me, it is exactly that framing that makes it so the use of that tool means "copying" more than it means "learning and taking inspiration and creating new art", because who is doing the learning and being inspired? The person who types "make me a 3d arena FPS" certainly didn't do any learning from the Quake source code. The AI itself, being just a program, can't take credit.

I think of a trained AI like a lossy, highly compressed copy of its training data set. AI companies charge access to decompress targeted pieces of that copy and the lossiness makes that decompression interesting and "new". But normally I can't charge for access to other people's stuff even if the access is highly lossy, like a camcorder bootleg.

> The camera analogy is a good one but I have never had a camera that had every great picture somebody else had taken, plus every work of art, baked into it.

I've never had an LLM that had any of that baked into it either. LLMs just have token correlations trained on those works. Trying to get an LLM to output the data it was trained on verbatim is something I'd expect to be heading into monkeys-on-typewriters territory. "Write something in the style of Shakespeare" and "give me the original text of Hamlet" are two very different things.

> I agree with the framing of the AI as a tool not an autonomous entity. The thing is, to me, it is exactly that framing that makes it so the use of that tool means "copying" more than it means "learning and taking inspiration and creating new art", because who is doing the learning and being inspired?

It's not learning or taking inspiration, though. It's just making statistical inferences based on token correlations. Whether or not that's analogous to how humans learn is something I think is a metaphysical question that is of little practical relevance. The fact remains that LLMs are not human, have no intentions of their own, do not exercise any kind of agency despite how often people employing the misnomer "agentic", and are ultimately glorified statistical models.

The LLM is a tool that extends human capacities in the same way as any other mathematical framework or technological device.

> I think of a trained AI like a lossy, highly compressed copy of its training data set.

I've seen a few people in this thread make that argument, but I just can't agree with it. It's not compression, lossy or lossless, which aims to deterministically encode a representation of the specific input data. The training data is analogous to the sample set used in a regression analysis to generate a polynomial function -- it's not valid to treat the output from any application of that polynomial as a copy of the original sample data.

Perhaps the future will be less Idiocracy and more Futurama, with humans and robots living socially together.
> Perhaps this illustrates a fissure that was always lurking under the surface, then(...)

Yes, I do think there has always been such a fissure. People publish OSS code for many reasons, often a blend of multiple reasons. There are selfish reasons such as the desire for one's work to be recognized, or even the hope of getting better employment through showing ones' skill or making something companies will pay for support on. There are social reasons like the desire to collaborate with others. There are altruistic benefit-of-all-mankind reasons like Richard Stallman said "...restrictions reduce the amount and the ways that the program can be used. This reduces the amount of wealth that humanity derives from the program."

It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style. But even adherents of that style, I think, are not too happy with AI consumption of their code. For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared. And for two, if your idea of OSS is that we all put our contributions into the great shared river of human achievements to benefit the world, it is disappointing to see that river funneled into a giant waterwheel of profit for a half dozen trillion dollar companies charging rent for its bounty.

> Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else.

I agree from a legal standpoint. I cannot enforce my personal definition of copying nor do I expect that to become possible. It was just conveniently aligned with the reality of how copying software worked in the past, and no longer is and never will be again. That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.

> It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style.

No, I'm well aware of the different motivations for and approaches to FOSS. I'm mostly focusing on the copyleft/GNU GPL side of the discussion here because that's the side of the house where most of ideas of a social contract and desire to see a specific ecosystem develop have been located. People on the MIT/BSD side of things, which has always had a much more direct "do whatever you want" ethos, are not the ones I'd expect to be making these arguments in the first place.

> For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared.

I'd agree that someone using an LLM to create a deterministic transcription of someone else's work is indeed violating the license. But I think the argument goes beyond that, into using LLMs in any way at all.

> That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.

That's a reasonable position, and from the perspective of examining whether the current LLM climate is sapping motivation to participate in FOSS, I can understand where you're coming from.

But to that point, I'd argue that if your motivation was to gain recognition, participate in a community, etc. then you're going to lose those things by keeping your code private anyway, whereas you won't necessarily lose those things just because an LLM was trained on your code. If you contribute to a popular project, people were almost certainly already using your work to do things you don't approve of -- if that didn't take away your motivation, why would LLMs do much worse?

> The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate,

That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").

Which part of which open source licenses gave you the impression that there were no restrictions?

> That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").

These are restrictions on redistribution, not use. And they're there to make sure that derivative works can't themselves impose restrictions on use.

One correction: the point of copyleft was to explot the restrictions in order to ensure that it would be possible for everyone to copy the software.