Hacker News new | ask | show | jobs
by Hydraulix989 2751 days ago
Depends on the licenses of said projects. It's pretty easy to whitelist MIT/Apache licensed code.

Even still, it's unclear whether an abstract machine learning model that "saw" copyrighted IP during the training process is infringing on said IP. A self-driving car sees many billboards during test rides.

2 comments

It's basically an extension of "everything is a derivative work". I believe it's similar to how you cannot (and should not be able to) erase your employee's brains of the code they worked on.
Those licenses still require you to display acknowledgements, copies of the original license, disclaimers etc.
They require you to do that when?

When you include the source code in a project? Or distribute binaries created in whole or in part from the source code?

The licenses say exactly what the requirements are, and when you are obligated to follow those requirements. What do they say about training AI models on the code?

> Distribute binaries created in whole or in part from the source code.

How exactly is that different from what's happening here?

Think of this more along the lines of when you read a code description of how something works and then describe it to someone else. In the same way that you don't have to provide copies of the license(s) of whatever you read, you shouldn't have to do that here, because the model is simplying learning and then inferring.
It's not as simple as that though. What if I train an AI to output a book given its title. With a sufficiently expressive and overfit network it could actually memorise the training data exactly.

That means my network now includes an exact copy of those books (encoded in a weird way but that is irrelevant). Should I be able to distribute my weights and ignore the copyright on the original books? Of course not!

In this case it is very unlikely that the network weights store much data from each individual github project, but it's definitely a sliding scale.