Hacker News new | ask | show | jobs
by sph 1259 days ago
Good idea, and we need the same thing for code repositories.

I want a derivative version of the GPL that says "this code or its output cannot be used as training data for machine learning, or any other artificial intelligence, unless explicitly authorized in writing by the copyright owners."

6 comments

I would be fine with my GPL code being used as training data. As long as attribution and copyleft is kept for any derivative work created by AI.

Which might as well be anything the AI outputs. Yes, please include my copyleft work, actually :-)

(I'm afraid having different versions of the GPL would split the free software / open source community, I'd prefer copyright laws to be updated / jurisprudence to decide that it's not ok to ignore licences)

> As long as attribution and copyleft is kept for any derivative work created by AI.

That might get your name on a list between a thousand others. And people who want to be on that list can get there easily by starting a (fake) GitHub repo.

In other words, being on the acknowledgement list stops being meaningful.

I was a bit tongue in cheek on this part of my comment, because that's the idea.

You are essentially saying that respecting the license I chose is impractical.

In other words, that my work cannot be used in such a setting.

So, either AIs need to stop being trained so liberally, or it becomes clear that copyright does not apply when used as training set. I appreciate that other people have different opinion on this, but I pretty much don't want my work to be used like this, and having a variant of licenses to cover this would be concerning.

The fight about this is not over, hence the hope I'm expressing.

It's not the acknowledgement list he's talking about, but the code that the AI spits out being GPL'd as well.
But it's also the acknowledgement list, and this part is not specific to the GPL, it also applies to BSD, MIT and all the other licenses which include an attribution clause, which is most of them.

If we require AIs to respect licenses of work in their training set, it's a mess, really. It might actually be unsolvable because licenses are not all compatible with each others.

I don’t really see why that’s a mess. There are two easy solutions I can think of right off the bat:

- Don’t train AI on work with incompatible licenses

- Don’t train AI on copyrighted work at all

IANAL, but unless explicitly disallowed, use as training data is allowed.

Given that no version of the GPL (or other OSI license) has provisions about that type of usage, they are all AI-friendly.

This is the whole question.

> Given that no version of the GPL (or other OSI license) has provisions about that type of usage, they are all AI-friendly.

With copyright laws, not explicitly granted means forbidden, unless covered/overridden by law (like fair use for instance). So this could be an argument against AI use without attribution / copyleft, since it's not explicitly granted.

Now, it remains to be seen if law / jurisprudence deems AI use exempt from copyright restrictions (or something like fair use applies).

Would the same attribution be required for code written by a human that learned from your example but did not copy it verbatim?

Do you attribute all of the code you ever learned from?

It may be because I’m a crappy programmer, but I often have SO or a search-discovered web page with someone else’s code while writing or debugging my own. Not for copy/paste, but to see how someone else used a function or solved a difficult problem that I’m stuck on.

I’ve never considered that to be a license violation. Is it?

No, unless your code looks sufficiently like the code you got inspiration from and this code is sufficiently non trivial.

For code coming from Stack Overflow, the license to follow would be CC BY-CA 4.0.

(which is kinda like GPL, and which is not the best license for code by the way)

https://stackoverflow.com/legal/terms-of-service#licensing

The license has no bearing on who or what may read code.

Licenses may control the circumstances of code facsimiles and products, but trying to stop your code from simply being seen or read with a disclaimer is impossible and silly.

I just want derivative work to respect the conditions I chose, that the copyright laws allow me to impose.

And I think training an AI is not like learning or reading for a human being.

And that's not silly. It's an opinion, but one shared with many people.

You know one of the core arguments is there. You tried to hide it and you didn't blink before using words like "simply" and "silly". That's disrespectful.

Simplicity is important and when something is silly it’s important to say so. If you feel disrespected I think you’re being too sensitive.

I think you’re confused about copyright law and about what is a facsimile vs a derivative work, and you’re also wrong about learning being fundamentally different for AI. AI learning is vastly superior and much more capable of both facsimile and imitation… but what is that if not learning; that is the definition of learning, plainly.

> I want a derivative version of the GPL that says "this code or its output cannot be used as training data for machine learning, or any other artificial intelligence, unless explicitly authorized in writing by the copyright owners."

I'm not sure how such a license would work, but I've written a license that does the next best thing: defines things in such a way that it's clear that the output of a machine learning algorithm is also subject to the license.

The license is at [1], and I have a whole set at [2].

Disclaimer: Do NOT use these licenses yet. They haven't been checked by a lawyer. I'm trying to build the funds to do so.

[1]: https://yzena.com/yzena-viral-license/

[2]: https://yzena.com/licenses/

> defines things in such a way that it's clear that the output of a machine learning algorithm is also subject to the license

Can you do this within copyright laws? (i.e. is it enforceable?) How are you sure?

For instance I could write a license that says that any code you wrote with a screen you've also used for displaying my code should be put under this license, but I'm pretty sure it would not be enforceable.

(another question: how is your license both permissive and viral?)

Good questions!

> Can you do this within copyright laws? (i.e. is it enforceable?) How are you sure?

I can do it under copyright because I was very careful to only do what has been tested with the GPL in court.

When the GPL was tested in court, at last once or twice the infringers tried to argue that the GPL didn't apply to machine code because machine code was substantially different from the copyrighted form (source code).

Courts have rejected this, saying that a different form is not substantially different.

My licenses basically encode that specifically, by saying that running any algorithm on the software that also produces software does not make the output of the algorithm substantially different from the input, even if the input is only partially the software under the license.

> (another question: how is your license both permissive and viral?)

Copy and paste error. (Since the licenses have many of the same things, I copied and pasted a lot.) I'll fix it next time I'm at my desktop machine.

It'll be interesting to see the legal battles around this considering, for humans, a clean room approach is often used to reimplementing an API, as it becomes an easy case for infringement (even if accidental) if you've seen the original implementation. If you ask ChatGPT to reimplement an API from a project it's indexed, seems it will likely infringe to some extent every time
Clean room design is there to ensure that the parts of the code that are expressive are not copied… the class structures and overall “shape”, the sort of arbitrary differences between different ways to organize data structures, functions and types (think: what everyone argues the most about).

If some functionality can basically only be expressed in one general way then this is not subject to copyright. This could either be due to performance constraints, simplicity, or something based on a mathematical algorithm.

It is incredibly hard to read through a codebase and not be inspired by the structure. “Oh, I like how this code was organized!”… it is hard to “forget” that! That’s the real reason for clean room design. To keep people from inadvertently using the “expressive, non-utilitarian structures” that ARE covered by copyright in works that are otherwise useful and subject to mechanical inventions covered by patents.

Yes, and if you ask an AI to reimplement functionally for which it has previously seen an exact implementation, it will likely spit out something that looks heavily influenced, if not identical, to the original. That's the thing about ML - it's still just trying to make the result match the query and doesn't really "care" about the originality of the response.
First, these code language models are best at the utilitarian aspects of programming and worst at the expressive aspects. At this point Copilot is going lead you down pretty awful paths towards organizing and structuring your code.

Second, when these language models improve to the point where they can assist in organizing and structuring code, this is where a given individual will be most prone to disagreeing with the model and having individual preferences for data types and whatnot!

Third, language models work best when they copy all works indiscriminately. Copyright is an individual right granted to a human author. Infringements are against specific human author's rights to their specific individual works, not all human authors and their unrealized collective works! The courts might find that the language models are not infringements but that individuals who USE these tools can indeed be found infringing on specific works! This would mean that these tools are putting the liability entirely on the user of the tools. I would expect that having these tools notify users that they are likely in violation of some kind of copyright claim would be very useful, a sort of "fingerprinting" of the abstract individual expressions of code structuring. This would probably expand on the growing notion of prior art in copyright and probably expose a lot of coincidental, "clean" patterns that should not be covered by copyright due to the frequency at which people tend to express in that given manner... like, the Factory Pattern or something like that.

The difficulty here is its not clear that a user (Human or otherwise) has explicitly agreed to a license when they read content, including published code, on the internet.

Even with that in a license file on a public repository on GitHub or elsewhere I don't think it is currently legally binding under the current understanding of the law. We need test cases to go though the courts and potentially amendments to copyright law to make it clear that a license can restrict a user's privet use of content that is on the public internet.

Alternatively it needs to be found under copyright law that a ML model infringers on the content it is trained upon, unless it is licensed for that use case. Again that is not the current understanding of current copyright law.

Point is, a license with that in it probably does nothing to help, currently.

There are some issue with this.

SEO is basically tricking machine learning models to rank you higher. And if someone posted spam, this clause shall and will not prevent it from being labelled as spam in service provider's training model.

I can see a case for generative model to be prohibited specifically, but commercial software would avoid adapting those data in their code in the first regardless, as it is contaminative.

In the end, I think the bigger question is still, what would forbidding machine learning models to train on your data is for? Why is it bad? There is already enough data out there that is free to be trained on, so if this is going to prevent those models from continuing existing, I doubt it will work.

Absolutely not. AI code generation is such a revolutionary boon already, the full potential of which has yet to be seen, that this kind of restriction would be unconscionable. Legally I hope it's found that AI needs no more permission to simply learn from something than a human does. In general I loathe the prospect od petty squabbling over warped perception over ownership of ideas getting in the way of progress.
What do copyright holders receive in return? I can get behind the appeal for technological advancement, but the only way this argument can be made in good faith is if we also mandate that any AI model trained on publicly available data be open sourced and released into public domain.

Maybe this would be an appropriate time to talk about abolishing copyright altogether rather than entertaining the idea of making exceptions for AI, i.e. tech oligopolies.