| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jbotz 5 days ago
	A translation of a book to a different language is a derivative work. So a translation of a computer program to a different programming language is also. But if in the translation of the book you start altering the plot and the personalities of that characters, does it at some point become not a derivative work? What point? IANAL, and I have no real idea, but I imagine that point has been probed significantly in case-law with respect to creative works. Given the current climate of ever-expanding scope of "intellectual property", if they admit that the LLM had access to git source code then I would say their case is weak at best.

4 comments

WD-42 5 days ago

The agents.md says “here’s the git source code” https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...

This isn’t even a question of training data, thy fed the full git source code directly to the llm.

link

To1ne 4 days ago

I would say it's worse, the whole C Git source code is checked in https://github.com/gitbutlerapp/grit/tree/main/git

link

throw-the-towel 5 days ago

I wonder if imitating clean room reverse engineering with two LLMs would be enough for licence compliance.

link

cyphar 4 days ago

That already exists[1]. It looks like a joke but apparently they will accept your money to do it, which seems to cross the line of a joke.

[1]: https://malus.sh/

link

anilgulecha 5 days ago

> translation.

It's not technically a translation, it's a re-implementation, with test suites acting as the destination. If it was a file by file translation your argument would have been valid.

link

20k 5 days ago

Git is part of the LLM's training set though, so simply asking it to recreate git in another language is pretty equivalent. Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

link

yusefnapora 5 days ago

As mentioned in another comment, it's even more clear cut in this case. They actually put the original git sources in their project repo and instructed the agent to use it as the "source of truth".

Simple thought experiment. If you handed this same agents.md file (https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...) to a human software developer and let them work on exactly the same goal, would their output be considered a derivative work?

link

spacechild1 5 days ago

That's something I have been wondering. If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation. I don't see why this shouldn't apply to LLMs as well. If an LLM might have been trained on the original source code, it should be considered "tainted".

link

20k 5 days ago

Yes, and realistically any code that LLMs produce is a derivative work of its training data. There's going to be a huge disaster licensing wise

I have absolutely no idea how LLMs got through anyone's legal departments, I guess the hope is that if everyone breaks the law enough, it'll just be fine

link

xorcist 3 days ago

> the hope is that if everyone breaks the law enough, it'll just be fine

Ever since the early 2010s when companies were started with the business idea "unlicensed hotels" and "unlicensed taxis" and made the owners really, really rich, this is said pretty much out loud. Look for words like "regulatory risks" and similar.

Maybe it started with the unlicensed gambling fad before that? That also made a lot of people filthy rich. Every time you have something under special license, or insuance requirements, then of course there is a margin for you if you can skimp on the license and hire gig workers instead.

The LLM situation with copyright and derived works in the 2020s is similar. Someone is likely to be rich, but there is a clear regulatory risk to it.

link

thewebguyd 5 days ago

> if everyone breaks the law enough, it'll just be fine

That's pretty much what happened, isn't it? These concerns were all discussed in the beginning back in 2022, and I recall answers from many here on HN along the lines of "oh well, we can't stop it now or we'll risk falling behind China in AI development"

So yeah, the laws went out the window a long time ago the moment our government and the people decided to just look the other way willingly in the name of "progress."

link

bcjdjsndon 5 days ago

Problem is there's a lot more than a single repo in training data, the corpus is massive... Should the author of a blog post on cats also be compensated for simply being in the same training data as the git repo?

link

20k 4 days ago

Honestly? Yes. This is why its such a problem that most of the training data was not used with permission, and without the correct copyright status or license associated with it

There's a lot of arguments about humans doing the same thing, but the reality is that humans and robots don't enjoy the same legal protection. Its clearly a derivative work of all of its training data

link

Pet_Ant 5 days ago

> If I as a human want to make a clean room reimplementation of some API or application, I must not have read the source code of the original implementation.

That is the difference between necessary and sufficient. Clean-room is sufficient to guarantee avoiding copyright, but it is not necessary. The line legally is south of there, but that position was chosen because they didn’t want to crossing and it was easier to argue for legally in court.

tl;dr: clean room is overkill for avoiding copyright infringement

link

rcxdude 5 days ago

> Like, you can almost certainly get these LLMs to output gits full source code with some prompting, so there's not that much difference (as much as we like to pretend that AI generated code has no copyright implications)

Are you sure? LLMs are in some way a compressed version of their input but it's a pretty lossy compression (arguably this makes them more like a compression algorithm than a compressed version of the data). I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

link

philipportner 5 days ago

> I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.

Granted, these are some of the most widely spread texts, and not codebases, but just fyi: https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

link

rcxdude 5 days ago

That paper is basically using the LLM as a compression algorithm: it's prompting with some section of the book and it's reprompting if it doesn't give the right output. Notably this only works if you already have a copy of the book in question!

link

20k 4 days ago

Distributed a compressed copy of something is still copyright infringement

link

alienbaby 5 days ago

Wouldn't a re-implementation be akin to 'heres how it works, write the code' rather than 'heres the code, redo it in rust'?

link

spwa4 5 days ago

Yes, but as soon as copyright became a problem for very rich people parts of it were cancelled.

1) re-implementation for compatibility (which was quickly "reestablished" through use of copyright-protecting encryption. In other words: do you get to write software that connects to MS/Apple/Google/Facebook servers without authorization from those companies? Yes. Do you get to copy an encryption key from their software to make it possible? No)

and, more recently,

2) violating copyright for LLM training

and, currently mostly attempted:

3) "uncopyrighting" run software through an LLM, and some people "believe" it comes out with your copyright on it! Because very rich people want to sell uncopyrighting.

Ie. the jury's still out what will happen when it's billionnaire vs billionnaire.

Of course, the question is what happens the second someone does this with a disney movie, or a big microsoft application ...

link

bcjdjsndon 5 days ago

> Yes, but as soon as copyright became a problem for very rich people parts of it were cancelled.

When copyright law was established, not many poor people owned printing presses. That is to say, copyright law is a PROTECTION to the very rich, not an inconvenience

link

spwa4 5 days ago

true but as the exception for model training (which can only be done by very, very rich people and organizations) shows, there's some new rich and they want new rules.

Against the will of the people, as evidenced by the court cases and protests online ...

link

miohtama 5 days ago

Related, software API compability is not a derivate work, or eligible to protection, as ruled in the US and in the EU. Google, SAP R/3, etc. cases.

Or SCO Vs IBM.

If everything would be a derivate work we would not Linux.

link