Hacker News new | ask | show | jobs
by jongjong 21 days ago
This is a great point. I think for coding, the wording of the MIT open source license makes it clear that copying and distributing the software is authorised on a small scale and it's very clear that the act of copying must involve a person.

It provides distribution and modification rights to "any person obtaining a copy of the software" and explicitly requires attribution for any significant parts.

Mass-ingesting the code with a script without any human even reading the licence is a very different kind of copying mechanism and there is no person involved... The contract was bypassed completely. A contract requires consent from both parties to be binding. When ingesting code into the AI training set, nobody even read the license. There was no agreement; neither explicit nor implicit... Because the consumer, a script, never read the contact for that specific project.

There was nobody present when the copying occurred; on neither side! It cannot possibly constitute an agreement between two parties.

3 comments

> I think for coding, the wording of the MIT open source license makes it clear that copying and distributing the software is authorised on a small scale and it's very clear that the act of copying must involve a person.

I agree with “must involve a person. https://opensource.org/license/mit starts with (emphasis added) “Permission is hereby granted, free of charge, to any PERSON obtaining a copy of this software and associated documentation files (the “Software”)”.

That means it doesn’t give an LLM any rights. The way I see it, LLMs run (directly or indirectly) by a person can do stuff on their behalf, though, just as your CI pipeline can download and compile MIT-licensed software.

I definitely disagree with the “on a small scale” as the license continues (again, emphasis added) “to deal in the Software WITHOUT RESTRICTION, including WITHOUT LIMITATION the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software”.

The CI pipeline is different because for a module to end up as a dependency in the CI pipeline, it had to be explicitly selected by a person first to be included in the package file or manifest. There was intentionality and awareness that the software was included.

A person already pre-consented to the licenses of all the software which the pipeline downloaded. Big companies go through those dependency lists carefully already and remove those which do not meet their policies. This is a very intentional process.

> for a module to end up as a dependency in the CI pipeline, it had to be explicitly selected by a person first

I disagree. I think it’s entirely within the license to have your pipeline automatically pull in the latest version of a library, even if the new one happens to pull in a new MIT-licensed library (whether that’s a good idea and whether CI pipelines should, somehow, verify that code pulled in has an acceptable license are different discussions)

I also think it’s complete within the MIT license to tell a LLM that it can search for MIT-licensed libraries and use them without asking you.

This would be an extremely novel mechanism of copyright litigation and I doubt it would fly in an American court with its' emphasis on highly individualized legal rights and obligations. And, if it did get accepted by the courts, that's halfway to an even crazier argument: that the MIT license only allows individual distribution to known parties; i.e. no hosting the code on a website or seeding it on BitTorrent, because that's not "small scale" and doesn't "involve a person".
You can only seed it on BitTorrent if it comes with the license which identifies the original author and acknowledges their copyrights over the code. Also there is definitely an assumption that a human will read the license or at least implicitly consent to the terms before using or modifying the software. When ingested by AI, the author gets zero credit and no consent has taken place between any sentient being on either side of the contract... Or at least none that are legally acknowledged as sentient or having legal rights.
And the thing is, you point out the easy out on this for similarly licensed code... a giant list of authors and contributors that may have code included in the generated output. It's a win/win for everyone. The original authors get their acknlowdgement, and the AI company gets to bill the users of AI for all the tokens for that multi-gigabyte copyright disclosure file.
That's like saying you're not allowed to load the source code into an editor, because it's not a person. Or that you're not allowed to run a global search-replace on the entire code base, because it's a script and not a person.
But in this case, a human has awareness of what software they are copying or modifying and that's how the original software author receives credit. The contract requires some degree of human awareness to be valid. This is the critical difference.
Sorry that's nonsense. There's human awareness when ingesting MIT code into an LLM too. In both cases it's a human that says $ excute-global-replace or $ ingest-into-llm

Both operations require some degree of human awareness. What you appear to be saying is, a human can only use a limited algorithm to access this source code, not a sophisticated one. And where do you draw that line? Who should get to say what is too sophisticated?

Error: your algorithm is too sophisticated to proceed, please provide more human awareness, it's a critical difference.

If your LLM were to hack into Microsoft and steal the source code from an important project and inject it into your project without you being aware of it; wouldn't that make you liable if you then published it?

Unfortunately there is no way to agree to a license of a software you're using if you didn't read the license or if you're not even aware that you're using the licence. This is what's happening at the training stage.

If you say that awareness doesn't matter then it means you cannot stop AI from stealing any IP open source or not.

I think the main issue with LLMs is that there is no mechanism to stop them from stealing. Thus they are guaranteed to infringe on copyright to some extent.

Also, beyond copying and copyright, there is another problem that LLMs are also infecting the logic and expertise built into the project. This is a completely novel mechanism and needs to be treated as separate under the law. Else it would be the end of all IP.

> I think the main issue with LLMs is that there is no mechanism to stop them from stealing.

Well, sure there is—for the people running them.

If you're building training data for an LLM, you only use data that a) is firmly in the public domain, or b) you have a clear and documented legal right to use.