| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by toastal 1748 days ago
	You are the free labor copilot to train Microsoft GitHub's Copilot tool. You are responsible for any of those insecure code errors and the diligence require. You will be on the hook for resulting problems. But Microsoft and their home-phoning, tracking-embedded editor will get real people to correct and train their machine for free—with their stated plan of later selling that machine back to us later. I wish there were a “robots.txt” file for Git to disallow certain bots from training on anything I have written.

4 comments

carl_dr 1748 days ago

> I wish there were a “robots.txt” file for Git to disallow certain bots from training on anything I have written.

It’s simple. If you are concerned by this, don’t host your repositories on GitHub.

link

tyingq 1748 days ago

You would have to not host your code publicly either, right?

link

HomeDeLaPot 1748 days ago

Merely hosting your code publicly seems like it wouldn't give GitHub the right to train AI models on it. You could even say it's against your terms of use. And to do it, they would have to go out of their way to find your repo on the web and clone it—unlikely.

My impression (NOT A LAWYER) is that by hosting your code in a public repo on GitHub, you agree to their terms and give them the right to "read" your code including training AI models on it. Or at least that's what they're banking on.

Go host on Sourcehut or self-host with Gitea, and I would think it unlikely (but not impossible) that any big company would use your code to train their AI.

link

hnfong 1748 days ago

It's not even very clear whether training an AI on OSS code is violation of those licenses. So unless you make your code public clearly under a proprietary license that clearly rejects such use, you can't really prevent people from doing that anyway.

Just imagine, there's really nothing preventing people from scraping your blog to train their natural language processing AI or whatever, why would code be any different? Even if you put up a big sign saying you don't consent to having your data ingested by a neural network, I doubt it will get noticed anyway...

People have been taking large OSS codebases (eg. Linux kernel) for various statistical analyses. AI is just doing the same thing in a more sophisticated manner.

link

tyingq 1747 days ago

I bet if I trained an AI on some vocalist and released an album I'd get some legal mayhem. I do concede it might go differently for code, but none of these issues are crystal clear for me.

link

toastal 1748 days ago

I wish it were easier to convince projects I like and want to help migrate for the same reason. Committing to their repos does not put me in the clear--including mere mirrors.

link

sobellian 1748 days ago

I would think that training a NN falls squarely in fair use.

link

nextlevelwizard 1748 days ago

You can always host is with license that doesn't allow reuse or something

link

tyingq 1748 days ago

GitHub mentions that they don't currently look at the license before trawling code.

https://twitter.com/NoraDotCodes/status/1412741339771461635

There's also other references that GitHub public repos weren't the only source. They trawl other publicly readable code.

link

nextlevelwizard 1747 days ago

You can sue them for using your code if they break the licensing agreement. Contact EFF and they'll set you up with a lawyer.

link

andix 1748 days ago

It is called LICENSE.txt. License your code as GPL and then Copilot can't reproduce bigger parts of your code.

But as long as you give the public access to your code, they can study it and learn from it. Humans and machines.

link

chrismorgan 1748 days ago

No, the license that you apply is completely irrelevant, and there’s certainly nothing whatsoever special about the GPL. Copilot is completely depending on being effectively exempt from copyright; if that legal theory falls apart, the entire space (and a lot of other machine learning stuff) is utterly doomed. Trouble is, Copilot can’t tell whether it’s reproducing copyrightable chunks of your code, or indeed where what it produces came from, by the very nature of machine learning techniques.

link

JamesSwift 1748 days ago

They could easily tag the source with license info and take that information into account when feeding data in.

link

chrismorgan 1747 days ago

That’s not how learning, human or machine, works. Learning is about collecting all kinds of stuff from diverse sources into a great melting-pot, so that you can form something new out of it—but you can’t generally identify where everything comes from. Individual recognisable tricks perhaps, but if you want to say “this code was inspired by X, Y and Z”, well, that inspiration is typically everything, the entire corpus.

link

wizzwizz4 1748 days ago

It could, actually, if it were augmented with the ability to do so – but that would be a bit more expensive.

link

gnomewascool 1748 days ago

I don't think that the GPL gives much more protection than any other FOSS license, here, in practice.

If Copilot were to reproduce a larger part of, say, an MIT-licensed codebase or almost any other permissive licence, then they should legally provide attribution. I'm pretty sure that they don't even have an option to provide such specific attribution, which means that either they believe that the code copied from any one source is below the relevant threshold or they're just ignoring copyright.

link

JamesSwift 1748 days ago

I would assume github could supercede your license by putting its own claim to your code in the TOS. I doubt they have done that, but just pointing it out.

link

andix 1748 days ago

I don’t think that’s possible, as long as you don’t actively accept that. Nobody can claim your copyright without your approval.

It would be also the end of GitHub, as most users probably won’t accept such terms.

link

0-_-0 1748 days ago

According to many people familiar with the legal aspect training on code constitutes fair use, so can't be prevented by any kind of license.

link

andix 1748 days ago

Training, exactly. But the trained person or AI is not allowed to reproduce your exact code. But Copilot seems to do that from time to time.

link

sobellian 1748 days ago

Sure they can, search engines produce copyrighted material all the time. The issue comes in when people think this somehow indemnifies them as users of Copilot - my guess is, it doesn't protect you any more than if you use a search engine to copy an entire codebase for your own purposes.

link

JamesSwift 1748 days ago

I dont disagree with users not accepting the terms, just pointing out that license text doesn't trump everything.

link

EamonnMR 1748 days ago

I'd love to see ML-GPL which specifically deals with using licensed property as a training set.

link

chrismorgan 1748 days ago

Not possible. Such licenses are founded upon copyright doctrine, and copyright doesn’t protect against learning, natural or machine. As it stands (and this can certainly change), legal consensus in general (regardless of jurisdiction) is that if you publish your code where they can reach it, they can use it.

link

rustc 1748 days ago

So would it (theoretically) be legal to train on the JS files services like gmail.com serve to the client? What about decompiled output of proprietary software like certain files in Windows and macOS?

link

macksd 1748 days ago

Except for any laws or restrictions against decompiling, it would legally be no different than the GPL case. Although personally I think since co-pilot is capable of redistributing the code, the question of whether the GPL permits the specific usage is still unclear.

link

ziml77 1748 days ago

I would expect so, though given the limitations of decompilation (in the absence of debug info) I don't know how useful it would be

link

toastal 1746 days ago

Or specifically no closed-source or closed-data tools. I wouldn't mind if a non-profit org wanted to help, but it's Microsoft—and they want to sell it back to us in the future.

link

danShumway 1748 days ago

:) I suppose you could always just add so much insecure code to your Github account that your expected value to Copilot is negative.

Although judging from the results of this test it kind of seems like for a lot of accounts that's already happened.

link

gfiorav 1748 days ago

Plus, GH doesn't care how you licensed your code. It will learn from it and produced licensed code.

link