Hacker News new | ask | show | jobs
by hiatus 746 days ago
Using it to train an LLM seems orthogonal to the output of the LLM. For instance, they could have their LLM include a link to the license. Merely training an LLM on the data does not seem to be against the spirit of GPL or Apache license.
4 comments

Someone could easily create a such a license. Free to use and distribute, $10,000 per line used for AI model training.

I'll very naively assume that Amazon, OpenAI, Google and others check licenses before feeding data to their models. I'll stop assuming that when one of these companies admit that they don't actually care and it's not profitable for them to respect licenses.

To make that enforceable, it would be nice to prove the AI was trained on it.

You might insert a "sleeper/activator" pair. The sleeper is a watermark that the AI will recall verbatim. To make it provide the sleeper, we give the AI a special activator prompt.

Demonstrating that your public repo successfully poisoned the AI with a watermark could become a court admissible proof of unauthorized scraping.

The LLM is quite literally a derivative work of GPL code. At the very least, there is an argument in such a case that the derivative function (the model weights) should conform to the same license.
I've heard AI advocates talk about a "right to read" or "right to learn"; meaning that we have the right to read something and then internalize it and use it. Therefore, why shouldn't an AI have the same right? The difference to me seems to be that the AI has the ability to regurgitate it in whole.

I can read a book, learn about the concepts, then use or repeat those concepts. The AI can do the same. But is it really "learning"? It may be just spewing out pieces of the content without any understanding. In which case it's a copyright violation, right?

Let's assume that both humans and AI can produce statements that are new and useful, and can both produce statements that violate copyright. For example a human can operate an illegal a video tube website where they serve verbatim copies of copyrighted movies.

I'll argue that's not enough reason to grant the AI the right to learn from copyrighted materials, because the right to learn is intimately wrapped up in human needs, while AI rights are focused on corporate and societal needs, which are currently being decided.

The human right to learn

You're a human and you need the right to learn from copyrighted material in order to not suffer Ignorance, in order to serve Society, because it's not feasible to charge you a rent for ideas you get from a book, and because it would cause suffering and indignity if we tried to charge you for your own thoughts.

With an AI, it's less clear it needs the right to learn from copyrighted material, because it's not a person that can suffer, and because the scale of its usage of copyrighted materials - and its potential harm to copyright holders - is about 5 orders of magnitude greater than that of any single person, and is potentially greater than the collective impact of human learners.

Let's lay out the reasoning:

1. No AI Suffering (yet). The AI doesn't suffer from ignorance and isn't (yet) a real person. So it needs no personal right to learn.

2. Potential Social Harm. AI could pose a much greater threat to copyright holders than the sum total of all human learners. We'll be weighing this potential in court, and it's currently not clear how the matter will be decided. Copyright holders could be awarded protections against corporations training AIs.

3. Ease Of Accounting. AIs and their training materials can be audited, unlike a human mind. So we have a technical means to restrict the AI's ability to learn from copyrighted materials.

4. No Harm in Accounting. Since the AI is not yet a person, and suffers no indignity or invasion of privacy from being audited, it's safe to audit and regulate the AI's training materials.

In summary it's important to remember that human rights exist because humans need those rights to enjoy life in a dignified way as persons, and because those rights benefit Society.

When we decide the question of AI rights, it's important to remember it's not a person, and any rights it has will be provided on the basis of societal benefit alone. It's not yet clear which AI rights will benefit Society here. It's quite possible that we will strengthen copyrights against unlicensed AI use, at least to some degree beyond the current "free-for-all".

You need to do more than include a link to the license to comply. You need to include the entire source code needed to compile the derived system.

For an LLM that would include:

1. Training data

2. Training code and metrics

3. Hyperparameter settings

4. Output weights

Anything less is really just misinterpretation of the nature of open source's provision for studying, modifying, and recompiling the LLM

Tldr; these companies MUST make the LLM into AGPL and provide all necessary codes as described above. Companies that refuse this will be raided by open source copyright trolls, if we're lucky and a little mischievous.