| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Aurornis 387 days ago

If you found your exact code in another client’s hands then it’s almost certainly because it was shared between them by a person. (EDIT: Or if you’re claiming you used Copilot to generate a section of code for you, it shouldn’t be surprising when another team asking Copilot to solve the same problem gets similar output)

For your story to be true, it would require your GitHub Copilot LLM provider to use your code as training data. That’s technically possible if you went out of your way to use a Bring Your Own Key API, then used a “free” public API that was free because it used prompts as training data, then you used GitHub Copilot on that exact code, then that underlying public API data was used in a new training cycle, then your other client happened to choose that exact same LLM for their code. On top of that, getting verbatim identical output based on a single training fragment is extremely hard, let alone enough times to verbatim duplicate large sections of code with comment idiosyncrasies intact.

Standard GitHub Copilot or paid LLMs don’t even have a path where user data is incorporated into the training set. You have to go out of your way to use a “free” public API which is only free to collect training data. It’s a common misconception that merely using Claude or ChatGPT subscriptions will incorporate your prompts into the training data set, but companies have been very careful not to do this. I know many will doubt it and believe the companies are doing it anyway, but that would be a massive scandal in itself (which you’d have to believe nobody has whistleblown)

4 comments

throwaway314155 387 days ago

Indeed. In light of that, it seems this might (!) just be a real instance of "i'm obsolete because interns can get an LLM to output the same code I can"

link

kapitanjakc 384 days ago

Hmm could very well be. But with comments intact ?

Anyway 1 thing that I did not consider and is pointed out by other comment is that original client could've provided the same code as they are also actual owners.

link

cmiles74 387 days ago

I believe the issue here is with tooling provided to the LLM. It looks like GitHub is providing tools to the LLM that give it the ability to search GitHub repositories. I wouldn't be shocked if this was a bug in some crappy MCP implementation someone whipped up under some serious time pressure.

I don't want to let Microsoft of the hook on this but is this really that surprising?

Update: found the company's blog post on this issue.

https://invariantlabs.ai/blog/mcp-github-vulnerability

link

Shekelphile 387 days ago

No, what you're seeing here is that the underlying model was trained with private repo data from github en masse - which would only have happened if MS had provided it in the first place.

MS also never respected this in the first place, exposing closed source and dubiously licensed code used in training copilot was one of the first thing that happened when it was first made available.

link

kapitanjakc 384 days ago

Or as the other comment points out that original clients might have used it on the code. So my conspiracy theory just came crashing.

link