Hacker News new | ask | show | jobs
by martinwoodward 91 days ago
No we won’t. Details here https://github.blog/news-insights/company-news/updates-to-gi...

For users of Free, Pro and Pro+ Copilot, if you don’t opt out then we will start collecting usage data of Copilot for use in model training.

If you are a subscriber for Business or Pro we do not train on usage.

The blog post covers more details but we do not train on private repo data at rest, just interaction data with Copilot. If you don’t use Copilot this will not affect you. However you can still opt out now if you wish and that preference will be retained if you decide to start using Copilot in the future.

Hope that helps.

35 comments

> https://github.blog/news-insights/company-news/updates-to-gi...

> Should you decide to participate in this program, the interaction data we may collect and leverage includes:

> - Outputs accepted or modified by you

> - Inputs sent to GitHub Copilot, including code snippets shown to the model

> - Code context surrounding your cursor position

> - Comments and documentation you write

> - File names, repository structure, and navigation patterns

> - Interactions with Copilot features (chat, inline suggestions, etc.)

> - Your feedback on suggestions (thumbs up/down ratings)

"should you decide to participate.."??? You didn't ask if I wanted to participate. You asked if I didn't.

I didn't get to decide to participate. I had to decide not to. You made me do work to prevent my privacy from being violated.

Do you use copilot?
First response: It doesn't matter if I use copilot right now. It matters if I will ever use copilot in the future. Opting-out is future-focused. What if I said "no, I don't use copilot, so I don't need to opt out", then a year from now start using copilot, completely forgetting about this whole debacle? That's the evil of opt-out. My inaction only benefits them, never me.

Second response: Maybe? I press the little button to auto-generate commit titles and messages that showed up in my Github Desktop. Does that count?

I'm asking sincerely. I don't "use Copilot" as in using it in VS Code or while writing code, so I'm honestly not sure if I am.

Do we get a choice? I did not ever explicitly enable it yet GitHub's web UI by default uses copilot to autofill my web-based edit commit messages. It also shows up on the home screen by default now.

I'm pretty sure if you use the site you're using GitHub Copilot in some way, so your question becomes irrelevant.

Do you think a single person works on a repo?
It's unnecessarily splitting hairs.

> interaction data—specifically inputs, outputs, code snippets, and associated context [...] will be used to train and improve our AI models

So using Copilot in a private repo, where lots of that repo will be used as context for Copilot, means GitHub will be using your private repo as training data when they were not before.

No it isn't. Most people don't use Copilot, so this term change won't effect most people. You can reasonably be unhappy about it anyways (or unreasonably still be using Copilot in 2026), but it's still ultra-useful information for them to add to the discussion.
Next step they'll rebrand search as "Copilot Search" or auto enable pull-request AI reviews (unless you hear about it and turn each off) and we'll all be "users".

Boiling the frog with a Venn diagram.

Copilot, or "chat with Copilot" is a button that is available on every page right next to the search bar.

I don't have to be a Copilot user to click on it.

This change is malicious, and it doesn't only affect Copilot users. It affects everyone on the platform!

Again, this collects usage data. If you click the button by accident and don’t interact, they get no data.
So? This feature is available to everyone and you have zero idea how many people actually use it.

If I go to one of your GPL projects and I ask a simple question to find out what this project is about, you will be perfectly "ok" that this interaction (that includes most of the code that is required to answer my dumb the question) will be used for training?

This is not ok.

Nobody in this subthread is saying if it's OK or not. We're just saying that it's very useful to know that this is what they're specifically collecting. Jiminy.
It's automatically enabled for example the other day I did a commit directly on GitHub and AI generated commit popup it had to read the code to work
> Most people don't use Copilot

So why do any of this at all? You're putting a large part of your customer base on edge in order to improve a service that "most people don't use." The erosion of trust this brings doesn't seem like a worthwhile or prudent sacrifice.

You're asking me to explain Microsoft AI strategy? Your guess is as good as mine.
I don't use copilot, but somehow was subscribed... I probably clicked something long ago and it just remained active.
They "gift you" a free standard plan if you have above a certain (non-transparent) level of stars, I don't think you can even disable your "subscription" if you get it for free.
They're only training on interactions with Copilot, not with the full contents of repos that happen to be subscribed to Copilot.
Make it opt-in then.
Isn't this pretty standard, using your interaction data for training and making it opt-out? Claude Code, Codex, Antigravity etc. all do the same. Private repo doesn't make a difference as they have a local copy to work from.
The initial title and your reply are both too broad to be fully accurate. By April 24th Github will train on private repos (assuming a flag isn't set) but this change is limited to just non-Business/Pro users. So a number of private repos will be effected but it won't automatically affect all private repos (so my panic check on our corporate account wasn't necessary yet).

I am not certain if you're a spokesperson for github - but it's good to be careful in your language. Instead of "No we won't" a lead like "That isn't entirely accurate" would be more suitable. In the end both the original post title and your reply have ended up being misleading.

> By April 24th Github will train on private repos

This statement itself is misleading. Also, GitHub probably should have seen this coming.

They are not doing what I initially thought, which is slurping up your private repo, wholesale, into its training set. You don't have to opt out of anything to prevent that.

They are slurping any context and input containing code from your private repo which is provided to them as part of using Copilot.

So, in addition to the opt-out setting, there is an even easier way to avoid providing them your private repository data to train AI models, and that's by continuing to not use Copilot.

Thats still pretty bad. Its no longer private if all your code goes through LLM training set and is resurfable to everyone publicly.

Why would I ever use copilot on any code Id want to be kept private? Labling it a private repo and having a tiny clause in the TOS saying we can take your code and show it to everybody is just an upright lie

I mean, you shouldn't send data to any SaaS LLM for code you want to be private, unless you have had them sign some sort of contract saying they will not train on your use. In fact, it is probably never a good idea to send anything you want to be private off premises unencrypted.
In the EU, opt-out is not a legally valid way to obtain the necessary consent. How do you plan to handle this?
probably by paying the fine and doing it anyway
s/fine/lawyers/
For personal data. I don't believe you can reasonably claim code is personal data any more than a hammer is your personal data.
Every Git commit is likely to contain personal data, in the form of the author’s name and email address usually present in a commit’s metadata. Furthermore, unless GitHub is prohibiting users from submitting personal data via their ToS (which, given the above, would be impractical), the only thing that matters is whether the data in fact contains personal data or not. GitHub cannot just assume that it doesn’t. And processing that data for new purposes requires user consent.
By that logic, you can't use any user input to train an LLM, because what if they decide to write their own name.
Indeed, you can’t unless you have appropriate consent. Which isn’t difficult to obtain if you have clearly defined purposes, but you have to do it.
Since commits aren't code, that's no problem.

The idea that because any piece of code could possibly contain some personal data -- while 99.99% of it doesn't -- that therefore the entirety is PD is not supported by the gdpr. You could as well say any text field anywhere can hypothetically have someone type their name and is thus personal data as well.

The current change applies to all input and output from and to Copilot. This can be used to create profiles about personal preferences, for example.

Personal data is about identifying a person and relating information to that person. A name in an unrelated text field isn’t personal data if you can’t tell the relation between the name and the person who input it, or any surrounding data. The contents of a repository, however, and the interaction with Copilot, can very well help identifying the account holder and their personal data. For example, I might be processing personal health data identifiable as such in a private repository with the help of Copilot.

> This can be used to create profiles about personal preferences

And since it's not, so what?

> I might be processing personal health data identifiable as such in a private repository with the help of Copilot.

That remains nonsense. The fact that you could put PD in a place not intended to hold PD does not magically transform entire datasets into PD because 1 record may contain it. This is covered in a24 (risk-based), and multiple edpb discussions of proportionate measures. There is zero requirement to guarantee anything collected for a different purpose is not misused by the user, assuming you're not encouraging that misuse.

Code often contains personal data. Here are over 400 files on GitHub with email addresses:

https://grep.app/search?regexp=true&q=%5Ba-z%5D%7B8%2C%7D%5C...

For example, license files often contain names and many package managers require a contact person.

When this goes to court, GitHub will probably make the excuse that they somehow did not know that people upload personal data, but the fact that this happens so often that they had to make a secret scanner to stop people from uploading their private keys will prove them as liars.

Hey Martin, can you please work with Product to significantly clarify what is meant by the following language in the settings? Because right now it's nearly impossible for a layperson (or even an average programmer) to understand what this means:

""" Allow GitHub to use my data for AI model training

Allow GitHub to collect and use my Inputs, Outputs, and associated context to train and improve AI models. Read more in the Privacy Statement. """

If the reality is less scary than how it sounds, then the wording needs to be less scary-sounding. It may be that GitHub isn't training models on private repos, but the language certainly suggests that it is. The feedback we're seeing in this post is proof enough of that.

Finally, I read the Privacy Statement, and it's unclear what the applicable language is. "Inputs," "Outputs," and "Associated Context" are terms of art that have no matching definitions in the Statement. (The terms "Outputs" and "Associated Context" don't even appear in the Statement at all. Not even "train.") As an attorney I find this completely baffling.

Yes, you will. This is what the setting says on my account when I clicked the link:

> model training

> Allow GitHub to collect and use my Inputs, Outputs, and associated context to train and improve AI models. Read more in the Privacy Statement

Are you seriously trying to claim that the code isn't input, output, or associated context of Copilot operating on a private repo? What term do you think better applies to the code that's being read as input, used as context, and potentially produced as output?

I don't like that they are training on any interactions with Copilot by default but training on something that you've put through Copilot yourself is much different than them just shoving all the private repos currently on Github into the training data.
If you are not willing to migrate out of GitHub, what you can do is to avoid using Copilot on your private repository.
I don't use Copilot, and I don't have anything I particularly care about in private repos on my account on Github. My reaction here is entirely based on principles, not how I'm going to be personally affected.
If Copilot later adds a feature like "Scan your repo for vulnerabilities using Copilot <opt-out>", then that would both fit your criteria, and the baiting outrage of the original poster, in one swoop! Of course, Microsoft would _never_ do that, right?
> If you don’t use Copilot this will not affect you.

How does this work for a private repository with access granted to additional contributors? Which setting is consulted then?

Nice try. If you're training on "inputs" to Copilot then you are training on the private repos.

This suspect denial is why I will get my clients moved off of github.

Back in my day someone would post a HN article to the internal slack in order to sway conversation in their favor. Glad to see its still happening! :D
Yes you do? If a user uses any form of copilot in one of his repos except ofc enterprise, says so right in the blog post. These aktshually corporate technicality defense posts aren’t helping, they just end up making you personally look a bit fishy.
Right, but it shouldn't be opt-out only to begin with. It's a dishonest pattern that relies on people not noticing. Honest use of data is a "Caesar's wife must be above suspicion" moment for me -- if this is how you're acting when engaging with customers explicitly, I don't trust you to resist the temptation to tap into my data privately. AI companies already have trained their models illegally against the intellectual property of all of humanity with little consent along the way.

Honestly, if you work at GitHub, maybe you should focus on your uptime -- it's awful.

I think the problem is more with using PRIVATE repos. My letters are also private and I would be pretty pissed if the mail carrier was reading them. Why does GitHub think it has the right to do this?
Appreciate the clarification. But, it's still not great.

To the PM behind this - developers are sensitive to this kind of thing. Just make it opt-in instead?

Say someone has a very sensitive secret (say, a Bitcoin private key) in their free private Github repo, and uses Copilot on that repo and touches the secret with it. Would you be willing to assure here that toggling that setting would not affect the likelihood of that secret leaking, and that that likelihood is also unaffected by whether the account is Business or Free?
Thanks for confirming you train on our data
Question. How does it work if I own a repository (opt out, don't use copilot) and I give access to someone else (use is opted in and uses copilot). Do you train on his submissions of my code? How can you know what that he has the right to share the code with you for training?
How do you handle accounts that have copilot managed by an organisation? I've seen several cases where people cannot opt out their account because of the org connection (the option just isn't there in the settings). What happens to their account the moment they leave that org?
Sorry doesn't help at all but you can still be useful - can you please tell us how many private repos do "users of Free, Pro and Pro+ Copilot" who have used Copilot in the last 90 days exist in the github database?

Because microsuck is about to violate the law that many times

I'm in the process of moving all of my repos off of github and deleting that account.

Hope that helps.

So you will train on data collected from free users working on GPL and copyrighted projects?
And on users that don’t even use github, other than the required account to use CoPilot in Visual Studio.
Exactly.

This affects anyone using VS Code or Copilot with proprietary data, including all the users automating workflows through the Copilot SDK and the like. A perfect storm.

Did anyone from GitHub's legal team actually authorise this, or did they use Copilot to sign off on it?

Under GDPR, opt-out is not considered informed consent, and repositories can contain personally identifiable information, which fall under GDPR. Do you think differently, or do you think ignoring the law will be worth it?
Thanks for the clarification. The OP here made me think I missed something in both the blog post about the change and in the available settings.
This is a distinction without a difference, according to the text of that enable/disable dialog,

> Allow GitHub to use my data for AI model training: Allow GitHub to collect and use my Inputs, Outputs, and associated context to train and improve AI models. Read more in the Privacy Statement.

“Associated Context” is the repo. If I use copilot, I’m giving it access to my repo.

I don’t know in all the ways copilot can be triggered, and I’m not certain that I could stop it from being triggered, given Microsoft’s past behaviors in slapping Copilot on everything that exists.

Can't you just make it opt-in?

No? Because no one would opt-in, you say?

Wow. It's almost like this is a user-hostile feature that breaks the implicit promise behind a "private" repo.

I think you're well aware that people aren't upset at the distinction between training on Copilot data versus training on private repo data (at rest). People are upset because GH is using an opt-out model. Your response is disingenuous not to address this, and the "hope this helps" comes across as condescending (not sure if that was your intention.)
As others have pointed out, this is somewhat dishonest. Which is depressing, if you represent GitHub.
>Hope that helps

Honestly, what the fuck? This changes was already pretty bad but this being the apparent corporate response is insane.

Done with Github and Microsoft after this. Just disgusting how little you care for users, ethics, or morals.

Why not get user consent first?
I am aware of CUI data hosted on github by corporate entities. You’re saying you’ll essentially violate the entire point of CUI?

That’s fucking terrifying.

Defaulting to opt-in is a malicious move, no matter how you present things.
"hope that helps"

Why the smug sarcastic attitude? nah, fuck github i'm out.

tl;dr: installed gitlab.

I'm not bidding against you to not train on my data.

“Opt-out” is an egregiously toxic and unethical approach to consent and should be illegal everywhere that it isn’t already.

I didn’t think Github had much of a brand left to damage, but here we are.