Hacker News new | ask | show | jobs
by zaarn 1775 days ago
If the code ends up being non-IP infringing on the code used to train it, that would be a big win for open source / free software.

Now you can just grab any leaked code of a closed source program, feed it into your AI and get back code you can license under the GPL and nobody can do anything about it.

An easy application I can think of is ZFS; simply feed the AI all CDDL licensed code, then ask it to reproduce ZFS. Probably will have some bugs but it would be licensable under GPL if the AI is considered a whiteroom.

5 comments

I think you’re missing that the law considers intent. If the devs of copilot were not trying to set up infringement, then their algorithm’s output is likely not considered infringement [1]. However, if you set out to “launder” copyrighted material then the law will take that into consideration and likely find that you violated copyright. This intent can be demonstrated in court either via your statements, or your actions (such as constructing a meaninglessly tiny training set).

[1]: https://ilr.law.uiowa.edu/print/volume-101-issue-2/copyright...

Would it not be categorically intended as infringement regardless of the copyright status of the material?

It seems to me that the licensing part is the part you can't throw into a big markov chain, legally. Even if they aimed only at open-source licensed material without exception, the point where they discard all the licenses and export a 'generic' slurry is the point where they infringe by definition. If they trained on more restrictive licenses that's just doubling down: what's needed is annotation and maintenance of what bits of code came from what licensing pool. You could well have a giant pool of GPL, a giant pool of MIT (which I would be in, all the more since I maintain a very automatable code style that's easy to import from). You could accumulate a list of sources for anything you did, at whatever level of granularity is desired.

The purpose of throwing away this attribution is intent to infringe. It's constructing a machine for the explicit purpose of grinding code into sludge of intentionally small enough pieces that, if you reconstruct copyrighted code in your markov-chainy way, you've got grounds for pretending you didn't build your machine to do exactly that.

> you've got grounds for pretending you didn't build your machine to do exactly that.

I believe all laws about intent have to deal with determining who is pretending and who isn't. But these laws still exist, because there are ways to prove such things

I don't think that is so easy, a tiny training set would obviously defeat the point but the other part is that the AI can't commit copyright infringement, and I don't have to ask it to produce anything. I merely fed it copyrighted code and released it to other developers without documenting that fact. Possibly open source the entire bot, as no part of the AI would be under the restrictions of the training set.
Again, the law isn’t enforced by robots and is able to adapt such that “clever legal hacks” don’t typically work. Us programming nerds tend to think in terms of rigid, unambiguous rules that treat inputs as black boxes, but the law does not work like this.
I'm well aware but I don't think this is an issue here.
It's exactly the issue.

If the AI could be shown to have copied the code it would likely to be found to be infringement.

If it was found to have generated new unique code, and merely leant how to program from the code it was trained on it likely wouldn't.

In either case, this is different to a clean-room implementation (which I think is what you said by "white room").

Clean-room implementations are supposed to protect against trade secret infringement, and are mostly used when building interop with hardware (where compatibility has special carve-outs).

If a person or AI had seen copyright code used in the project it would never be considered clean room.

But CDDL code is fine for a person or AI to learn from when building a new, incompatible implementation that doesn't share any code.

If you hire a programmer that has worked on said closed source and ask them to recreate that code in your GPL licensed program - at what point will that be considered infring on the original? Can you relicense code of you feed it through a programmer?
A programmer is a living breathing human, a legal entity that can hold copyright and is capable of violating copyright.

An AI cannot hold copyright however and isn't capable of violating copyright (legal entities are, which an AI is not).

It’s called a transpiler; you don’t need AI for this, but it’s obviously still licensed same as original - because it is the original, only translated.
I'm not talking about a transpiler, I'm talking about feeding massive amounts of non-GPL code into an AI and then ask it to produce new code based on that. A transpiler would simply take a single codebase and translate it into a new format, the obvious difference being that such a tool has an obvious and introspectable transformation function.
If the resulting code works the same way, it’s still a transpiler. If the resulting code works in a different way… then I have to ask what exactly does it do and how’s that supposed to be useful.
There should be no reason for CDDL to not be includable under GPL license. Canonical does include it for example.
This would eliminate any ambiguity, ZFS would simply be GPL now.
If you replaced CDDL with GPL you’d lose patent protection. Good luck with Oracle lawyers.
Most of the relevant ZFS patents already expired, I don't think there is anything for the Oracle lawyers. Plus i live in a country where Software patents aren't recognized, so double good luck to Oracle.
On some level yes. This is basically shifting closed repos to a trade secret status.