| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rockemsockem 1158 days ago

I'm not sure if most data that these models are trained on is copyrighted, but I feel pretty safe saying that a majority of data that human beings have created is copyrighted. Think movies, books written recently, every website that isn't explicitly "creative commons" or something similar, code that isn't permissively licensed, etc.

We definitely need clarification, but however long the first court case takes there will be an appeal, and then probably several more. So I'm afraid we're going to be living in limbo for at least a decade, which is sort of an answer in of itself since by that time services like this will have become pervasive and will have been integrated into lots of workflows across the planet.

It seems to me that training on MS proprietary code is perfectly legal, but how you acquire that code is probably important. If you are able to decompile the code from your Windows machine and use it for training then that looks A-OK, but if you use Microsoft code that was leaked as part of a hack then maybe that's a different story since you're in possession of stolen property.

1 comments

simion314 1158 days ago

Windows code was released for some researchers and was also leaked.

IMO a good train NN would not be against copyright, but even if is decided the opposite we are not screwed, you can train open models using open source code and permissive licensed text and art. Microsoft could try to buy some data sets but they would not be able to mix GPL/MIT like content into their proprietary models so the open models would win , IMO.

The open source community is already working on creating data sets for training so this will grow as open source for code grew in the past, we just need a bit of time for this software to get more efficient or people to get some better hardware.