| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lofatdairy 1337 days ago
	GP is just highlighting why this is so common and often a challenging edge case. If you ask it for something that's exactly in its dataset, the "best" solution that minimizes loss will be that existing code. Thus, it's somewhat intrinsic to applying statistical learning to text completion. This means MS really shouldn't have used copyleft code at all, and really shouldn't be selling copilot in this state, but "luckily" for them, short of a class action suit I don't really see any recourse for the programmers who's work they're reselling.

2 comments

kmeisthax 1336 days ago

Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.

My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.

[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.

[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.

link

fweimer 1337 days ago

Pretty much all code they have requires attribution, and based on reports, Copilot does not generate that along with the code. So excluding copyleft code (how would you even do that?) does not address the issue (assuming that the source code produced is actually a derivative work).

link

lofatdairy 1336 days ago

That's a good point. I was thinking that during the curation phase of the dataset they should check for a LICENSE.txt file in the repo, and just batch exclude all copyleft/copyright containing repositories. This obviously won't handle every case as you say, and when it does generate copyleft code it will fail to attribute, but hopefully not having copyleft code in its dataset or less of it reduces the chance it generates code that perfectly satisfies its loss function by being exactly like something its seen before.

The main problem I see with generating attribution is that the algorithm obviously doesn't "know" that it's generating identical code. Even in the original twitter post, the algorithm makes subtle and essentially semantically synonymous changes (like the changing the commenting style). So for all intents and purposes it can't attribute the function because it doesn't know _where_ it's coming from and copied code is indistinguishable from de novo code. Copilot will probably never be able to attribute code short of exhaustively checking the outputs using some symbolical approach against a database of copyleft/copyrighted code.

link