Hacker News new | ask | show | jobs
by toastal 1748 days ago
You are the free labor copilot to train Microsoft GitHub's Copilot tool. You are responsible for any of those insecure code errors and the diligence require. You will be on the hook for resulting problems. But Microsoft and their home-phoning, tracking-embedded editor will get real people to correct and train their machine for free—with their stated plan of later selling that machine back to us later.

I wish there were a “robots.txt” file for Git to disallow certain bots from training on anything I have written.

4 comments

> I wish there were a “robots.txt” file for Git to disallow certain bots from training on anything I have written.

It’s simple. If you are concerned by this, don’t host your repositories on GitHub.

You would have to not host your code publicly either, right?
Merely hosting your code publicly seems like it wouldn't give GitHub the right to train AI models on it. You could even say it's against your terms of use. And to do it, they would have to go out of their way to find your repo on the web and clone it—unlikely.

My impression (NOT A LAWYER) is that by hosting your code in a public repo on GitHub, you agree to their terms and give them the right to "read" your code including training AI models on it. Or at least that's what they're banking on.

Go host on Sourcehut or self-host with Gitea, and I would think it unlikely (but not impossible) that any big company would use your code to train their AI.

It's not even very clear whether training an AI on OSS code is violation of those licenses. So unless you make your code public clearly under a proprietary license that clearly rejects such use, you can't really prevent people from doing that anyway.

Just imagine, there's really nothing preventing people from scraping your blog to train their natural language processing AI or whatever, why would code be any different? Even if you put up a big sign saying you don't consent to having your data ingested by a neural network, I doubt it will get noticed anyway...

People have been taking large OSS codebases (eg. Linux kernel) for various statistical analyses. AI is just doing the same thing in a more sophisticated manner.

I bet if I trained an AI on some vocalist and released an album I'd get some legal mayhem. I do concede it might go differently for code, but none of these issues are crystal clear for me.
I wish it were easier to convince projects I like and want to help migrate for the same reason. Committing to their repos does not put me in the clear--including mere mirrors.
I would think that training a NN falls squarely in fair use.
You can always host is with license that doesn't allow reuse or something
GitHub mentions that they don't currently look at the license before trawling code.

https://twitter.com/NoraDotCodes/status/1412741339771461635

There's also other references that GitHub public repos weren't the only source. They trawl other publicly readable code.

You can sue them for using your code if they break the licensing agreement. Contact EFF and they'll set you up with a lawyer.
It is called LICENSE.txt. License your code as GPL and then Copilot can't reproduce bigger parts of your code.

But as long as you give the public access to your code, they can study it and learn from it. Humans and machines.

No, the license that you apply is completely irrelevant, and there’s certainly nothing whatsoever special about the GPL. Copilot is completely depending on being effectively exempt from copyright; if that legal theory falls apart, the entire space (and a lot of other machine learning stuff) is utterly doomed. Trouble is, Copilot can’t tell whether it’s reproducing copyrightable chunks of your code, or indeed where what it produces came from, by the very nature of machine learning techniques.
They could easily tag the source with license info and take that information into account when feeding data in.
That’s not how learning, human or machine, works. Learning is about collecting all kinds of stuff from diverse sources into a great melting-pot, so that you can form something new out of it—but you can’t generally identify where everything comes from. Individual recognisable tricks perhaps, but if you want to say “this code was inspired by X, Y and Z”, well, that inspiration is typically everything, the entire corpus.
It could, actually, if it were augmented with the ability to do so – but that would be a bit more expensive.
I don't think that the GPL gives much more protection than any other FOSS license, here, in practice.

If Copilot were to reproduce a larger part of, say, an MIT-licensed codebase or almost any other permissive licence, then they should legally provide attribution. I'm pretty sure that they don't even have an option to provide such specific attribution, which means that either they believe that the code copied from any one source is below the relevant threshold or they're just ignoring copyright.

I would assume github could supercede your license by putting its own claim to your code in the TOS. I doubt they have done that, but just pointing it out.
I don’t think that’s possible, as long as you don’t actively accept that. Nobody can claim your copyright without your approval.

It would be also the end of GitHub, as most users probably won’t accept such terms.

According to many people familiar with the legal aspect training on code constitutes fair use, so can't be prevented by any kind of license.
Training, exactly. But the trained person or AI is not allowed to reproduce your exact code. But Copilot seems to do that from time to time.
Sure they can, search engines produce copyrighted material all the time. The issue comes in when people think this somehow indemnifies them as users of Copilot - my guess is, it doesn't protect you any more than if you use a search engine to copy an entire codebase for your own purposes.
I dont disagree with users not accepting the terms, just pointing out that license text doesn't trump everything.
I'd love to see ML-GPL which specifically deals with using licensed property as a training set.
Not possible. Such licenses are founded upon copyright doctrine, and copyright doesn’t protect against learning, natural or machine. As it stands (and this can certainly change), legal consensus in general (regardless of jurisdiction) is that if you publish your code where they can reach it, they can use it.
So would it (theoretically) be legal to train on the JS files services like gmail.com serve to the client? What about decompiled output of proprietary software like certain files in Windows and macOS?
Except for any laws or restrictions against decompiling, it would legally be no different than the GPL case. Although personally I think since co-pilot is capable of redistributing the code, the question of whether the GPL permits the specific usage is still unclear.
I would expect so, though given the limitations of decompilation (in the absence of debug info) I don't know how useful it would be
Or specifically no closed-source or closed-data tools. I wouldn't mind if a non-profit org wanted to help, but it's Microsoft—and they want to sell it back to us in the future.
:) I suppose you could always just add so much insecure code to your Github account that your expected value to Copilot is negative.

Although judging from the results of this test it kind of seems like for a lot of accounts that's already happened.

Plus, GH doesn't care how you licensed your code. It will learn from it and produced licensed code.