Hacker News new | ask | show | jobs
by Gentil 1461 days ago
> Any secrets it generated would have been already public and compromised.

Two things..

1. This is not some vigilante hacking groups you are talking about. This is MS - a humongous corp. They know doing this is wrong. So it shouldn't matter. It is still illegal and making a proprietary software product is exactly the reason why these things exist. Don't you think?

2. Already public unintentionally. AFAIK, any code that I produce is automatically copyrighted to me. This means if I write something in public and not provide a license, IT IS LEGALLY under the copyright protection provided to me by my country. At least that is the case in US and India which are home to a huge portion of OSS. Do correct me if I am wrong. Putting it like that in public would be just plain stupid for sure. But legally it is still mine. Reproducing it and remixing my work would be illegal. It is just my PITA now to prove that in court is all.

> Whether doing so violates the licenses is probably up for courts to decide.

I am yet to see any attribution to ANY OSS code that it is trained on. The PII and secrets is enough to find out the license of a repo which would make it easier to prove whether they violated it or not. Don't tell me all OSS that copilot has trained on is only public domain stuff. Even ISC license needs attribution.

> Human programmers can do this with search engines.

OK. CAN. And that is wrong too. So when they do, we should incriminate them as well if evidence is out there or if a matter such as this comes to light. How does that change anything I mentioned? Very curious.

1 comments

> So it shouldn't matter.

It definitely does matter that any secrets generated were already public:

* Emitting secrets from private repos would be a huge confidentiality issue (though really you shouldn't commit code secrets to git at all), as it'd be taking something that's private + exploitable and making it public

* Emitting secrets that are already public doesn't cause the confidentiality issue. Once a secret is out, it's out, and should be changed immediately. By the time it's in Copilot's training set, it'll have already been on search engines/archive sites/black-hat forums/etc.

Tangentially, GitHub do also do some scanning to alert of accidentally committed secrets in repos: https://docs.github.com/en/code-security/secret-scanning/abo...

> 2. Already public unintentionally.

Right, but therefore already compromised and no longer confidential. Copilot isn't leaking any secrets, someone else did by making them public.

> AFAIK, any code that I produce is automatically copyrighted to me. This means if I write something in public and not provide a license, IT IS LEGALLY under the copyright protection provided to me by my country. At least that is the case in US and India which are home to a huge portion of OSS.

Essentially correct, to my understanding. If you're making it public, you'll generally also give some hosting/publishing/distribution rights to the services involved - as specified by their T&C.

> Reproducing it and remixing my work would be illegal

The US has the concept of fair use which provides exceptions for "transformative” purposes. For example: copying and downscaling your image to use as a thumbnail, caching the webpage your work is on, or creating a parody of your work.

Consider Google Books for example, where Google scanned millions of copyrighted books and made them searchable (showing snippets). This was ruled fair use due to being transformative.

Question would be whether code generated by Copilot that falls under this. Ultimately it's up to the courts to decide, but I'd lean in favor of "yes".

> The PII and secrets is enough to find out the license of a repo which would make it easier to prove whether they violated it or not. Don't tell me all OSS that copilot has trained on is only public domain stuff. Even ISC license needs attribution.

Fair use is about unlicensed usage, so if it's fair use then it doesn't need to abide by the terms of the licenses. Even if it's ruled not to be fair use, I think they could still train it on GitHub-hosted code due to the mentioned rights you give them by agreeing to GitHub's T&C.

> How does that change anything I mentioned? Very curious.

Changes your claim of impossibility, so now it's just about whether there's a violation.