| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saagarjha 2751 days ago
	> IntelliCode uses machine learning to train over thousands of real-world projects including open-source projects on GitHub. Is it legal to use a bunch of open-source projects in what looks like a proprietary plugin?

1 comments

Hydraulix989 2751 days ago

Depends on the licenses of said projects. It's pretty easy to whitelist MIT/Apache licensed code.

Even still, it's unclear whether an abstract machine learning model that "saw" copyrighted IP during the training process is infringing on said IP. A self-driving car sees many billboards during test rides.

link

userbinator 2751 days ago

It's basically an extension of "everything is a derivative work". I believe it's similar to how you cannot (and should not be able to) erase your employee's brains of the code they worked on.

link

fulafel 2751 days ago

Those licenses still require you to display acknowledgements, copies of the original license, disclaimers etc.

link

WalterGR 2751 days ago

They require you to do that when?

When you include the source code in a project? Or distribute binaries created in whole or in part from the source code?

The licenses say exactly what the requirements are, and when you are obligated to follow those requirements. What do they say about training AI models on the code?

link

groceryheist 2750 days ago

> Distribute binaries created in whole or in part from the source code.

How exactly is that different from what's happening here?

link

haneefmubarak 2750 days ago

Think of this more along the lines of when you read a code description of how something works and then describe it to someone else. In the same way that you don't have to provide copies of the license(s) of whatever you read, you shouldn't have to do that here, because the model is simplying learning and then inferring.

link

IshKebab 2750 days ago

It's not as simple as that though. What if I train an AI to output a book given its title. With a sufficiently expressive and overfit network it could actually memorise the training data exactly.

That means my network now includes an exact copy of those books (encoded in a weird way but that is irrelevant). Should I be able to distribute my weights and ignore the copyright on the original books? Of course not!

In this case it is very unlikely that the network weights store much data from each individual github project, but it's definitely a sliding scale.

link