Hacker News new | ask | show | jobs
by Xirdus 10 days ago
> Both would resolve to the same question, no?

Yes but my answer would be different. It can be either about what coding agents do (and you'll see that it breaks the social contract), or it can be about what the FOSS social contract is (and you'll argue that coding agents don't break it.) Lo and behold, it was the latter.

> There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code.

Not any work. But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.

There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?

The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.

AI gives the user all the benefits of using open source software with none of the obligations that come from using open source software. Developer gains nothing from going open source. It makes no sense for any developer to go open source. Social comtract breaks down, and it's all because AI users didn't hold up their half of the bargain.

1 comments

> But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.

I don't disagree with the premise that any LLM that is cloning code wholesale from a third-party repo is creating a derivative work, and the license terms apply to it.

But I also don't agree that non-AI code generators such as transpilers are in the same category as LLMs -- a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.

> There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?

The model isn't a lossy compression archive that merely represents a collection of pre-existing works in parallel to each other. It's a probability matrix that relates together uniquely isolatable units of data to each other across the entire collection.

If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

> The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.

I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.

There's always been both a social and legal obligation to properly attribute authors and abide by license terms when redistributing or forking FOSS code, but neither obligation has ever applied when learning programming techniques from FOSS code in order to write your own software. And the way LLMs are designed to work is more similar to the latter than to the former.

But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.

>If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

If you write "To see or not to see, that is the question" about a person named Eyelet, who is going blind, how can you argue that it is NOT derivative of / borrowed from Hamlet? Yet that sentence is not in the work. Isn't that what LLMs essentially do? Tokenize, then substitute in new values for certain tokens, while retaining the general structure?

> a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.

There are stochastic compression algorithms (e.g. https://github.com/kaydotdev/sqic) and it would be insane to claim they don't produce derivative works. And as a general rule, a work based on multiple other works is derivative of all af them.

> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

No, but your generated text is also useless if you want to read Hamlet. The danger I'm speaking of is people generating Hamlets but paraphrased - that's a derivative, especially if you use an automated tool that got original Hamlet as its input. Except the Hamlet in question is the Linux kernel but not bound by GPL. Also, your Markov chain itself is a derivative work.

> I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.

True, but that fraction of a huge number is still big enough to be meaningful help. Plus the recognition. Most users respect the attribution clause. AI legally-distinct clones drop the fraction of helpers and the number of attributions straight down to 0. That changes the equation, what previously made sense now straight up doesn't.

> But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.

And because OpenAI et al. hold all the money and all the lawyers, the only way to hold them accountable is to stop publishing open source altogether. That's the only leverage OSS community has.

> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

Uh, that is exactly what a derivative work is. You literally specify that Hamlet is an input to your work. I believe you're conflating derivative with transformative. You're certainly creating a transformative derivation of Hamlet, but you are by definition creating a derivative work by training a Markov chain on the text of Hamlet.

The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under. Others argue that there's not an exact copy of the original source in the LLM's weights so by definition it must be a transformative work. I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.

Personally, I don't care one way or the other. I'm one of the folks that thinks software shouldn't be copyright-able in the first place.

> Uh, that is exactly what a derivative work is.

No, it isn't. A derivative work isn't something based on extracting underlying ideas or patterns from another work, it's something that includes copyrighted portions of the other work.

An annotated edition of Hamlet is a derivative work. A Cliff's Notes summary of Hamlet is a derivative work.

Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet. A list of word counts of popular works of literature that includes an entry for Hamlet is also not a derivative work. The Markov chain described above is not a derivative work.

> The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under.

And I would agree with them. An LLM that actually is outputting non-trivial code that matches a public project's code verbatim is engaging in copying, and not stochastic inference.

> I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.

It's a shame that the same fundamental questions have to be relitigated over and over again just because the contextual formalities and modes of expression have changed. I wonder how many of the legal cases are going to be copies or derivative works of previous ones.

> Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet.

But try to write your own story of a lion cub chased away by his uncle and living in a jungle until his childhood friend finds him and convinces him to reclaim his kingdom, and you'll quickly hear from Disney's lawyers how non-derivative it really is.

OSS devs aren't worried about Hamlet reinterpretations. They're worried about legally-distinct-but-functionally-identical software clones. Unlike Disney, they don't have millions in their pockets to fight the legal battle. You know who does have millions? The people they'd be fighting against, who are going to use every single of your arguments to claim their AI-generated reimplementation of Kefir is not bound by GPL (or even by BSD 3-clause in case of runtime). No share-alike, no attribution, no nothing. If they are right, then the OSS social contract is dead. Even if they're not right, but behave as if they're right because they have lawyers and OSS devs don't - the social contract is just as dead.

> But try to write your own story of a lion cub chased away by his uncle and living in a jungle until his childhood friend finds him and convinces him to reclaim his kingdom, and you'll quickly hear from Disney's lawyers how non-derivative it really is.

I'd expect them to say "we don't like this, but since it's not actually a derivative work, we can't do anything about it". As long as you're not directly copying things like characters, dialogue, etc., it's not a derivative work.

That's why Armageddon is not a derivative work of Deep Impact, the Shark Attack series is not a derivative work of Jaws, the more famous Titanic is not a derivative work of 1979's S.O.S. Titanic, and the Harry Potter series is not a derivative work of Teen Witch.

Using the same story themes, plot points, and setting as another work does not implicate that other work's copyright. Only substantial copying of specifics does.

> As long as you're not directly copying things like characters, dialogue, etc., it's not a derivative work.

Define a character. Is another lion prince named Simba the same character? Is a lion prince named something else the same character? Is a human prince named Simba the same character? I'm no copyright expert, but from what I know about fanfics and fanart, the US courts ruled all of these violate copyright (you can win a book plagiarism lawsuit even if the other book has all names changed and every sentence went through thesaurus). The few cases where the obvious stand-in was ruled non-infringing were on the grounds of parody exception, not on the grounds of being non-derivative.

The many Titanic movies are not each other's derivatives because none of them are based on each other. They're all based on the historical events directly. Now, if the original Titanic was fictional like the famous Nautilus, then yes, the 1997 movie would be derivative, but not of the 1979 series.

Which part of Harry Potter is directly rips off Teen Witch the way Lion King directly rips off Hamlet? I'm not familiar with that movie.