Hacker News new | ask | show | jobs
by NoraCodes 174 days ago
You - and many other commentors in this thread - misunderstand the legal theory under which AI companies operate. In their view, training their models is allowed under fair use, which means it does not trigger copyright-based licenses at all. You cannot dissuade them with a license.
5 comments

While I think OP is shortsighted in their desire for an “open source only for permitted use cases” license, it is entirely possible that training will be found to not be fair use, and/or that making and retaining copies for training purposes is not fair use.

Perhaps you can’t dissuade AI companies today, but it is possible that the courts will do so in the future.

But honestly it’s hard for me to care. I do not think the world would be better if “open source except for militaries” or “open source except for people who eat meat” license became commonplace.

The problem are "viral" licences. Must the code generated by an AI trained with GPL code be released with a GPL licence?

Also, can an AI be trained with the leaked source of Windows(R)(C)(TM)?

> Also, can an AI be trained with the leaked source of Windows(R)(C)(TM)?

I think you mean to ask the question "what are the consequences of such extreme and gross violations of copyright?"

Because they've already done it. The question is now only ... what is the punishment, if any? The GPL requires that all materials used to produce a derivative work that is published, made available, performed, etc. is made available at cost.

Does anyone who has a patch in the Linux kernel and can get ChatGPT to reproduce their patch (ie. every linux kernel contributor) get access to all of OpenAIs training materials? Ditto for Anthropic, Alphabet, ...

As people keep pointing out when defending copyright here: these AI training companies consciously chose to include that data, at the cost of respecting the "contract" that is the license.

And if they don't have to respect licenses, then if I run old Disney movies through a matrix and publish the results (let's say the identity matrix)? How about 3 matrices with some nonlinearities? Where is the limit?

Since copyright law cannot be retroactively changed, any update congress makes to copyright wouldn't affect the outcome for at least a year ...

Open source except for people who have downvoted any of my comments.

I agree with you though. I get sad when I see people abuse the Commons that everyone contributes to, and I understand that some people want to stop contributing to the Commons when they see that. I just disagree - we benefit more from a flourishing Commons, even if there are free loaders, even if there are exploiters etc.

Of course, if the code wasn't available in the first place, the AI wouldn't be able to read it.

It wouldn't qualify as "open source", but I wonder if OP could have some sort of EULA (or maybe it would be considered an NDA). Something to the effect of "by reading this source code, you agree not to use it as training data for any AI system or model."

And then something to make it viral. "You further agree not to allow others to read or redistribute this source code unless they agree to the same terms."

My understanding is that you can have such an agreement (basically a kind of NDA) -- but if courts ruled that AI training is fair use, it could never be a copyright violation, only a violation of that contract. Contract violations can only receive economy damages, not the massive statutory penalties that copyright does.
Having a license that specifically disallows a legally dubious behavior could make lawsuits much easier in the future, however. (And might also incentivize lawyers to recommend avoiding this code for LLM training in the first place.)
People think that code is loaded into a model, like a massive available array of "copy+paste" snippets.

It's understandable that people think this, but it is incorrect.

As an aside, Anthropic's training was ruled fair use, except the books they pirated.

Fair use is a defense to copyright violation, but highly dependent on the circumstances in which it happens. There certainly is no blanket "fair use for AI everything".