Hacker News new | ask | show | jobs
by chrischen 34 days ago
What exactly did they train? Copilot is powered by claude, gemini, or ChatGPT these days.

Did they train autocomplete? I mean the code is open source so anyone can scrape it and train it too. I'm kind of glad they did train it because otherwise we'd still be stuck with Apple level AI models right now.

The whole reason we have so many models, including open weight models, that are all competitive with each other is because the data is free and anyone can be training off it. If the goal was to monetize the source code I guess the authors shouldn't make it open source.

2 comments

> "GitHub Copilot is powered by generative AI models developed by GitHub, OpenAI, and Microsoft. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub."

https://azure.microsoft.com/en-us/products/github/copilot#fa...

Yeah have to agree here, Github Copilot itself doesn't have any first party models they use the frontiers. So, they didn't "train" using public repos but they probably allowed (or didn't prevent) the frontiers from pulling the repos along with the rest of the internet when creating their models.
Is y'alls collective memory so short? Copilot just a few years ago was auto complete on steroids that was entirely first party and trained by GH on users' code.
It used OpenAI's Codex model (see: https://en.wikipedia.org/wiki/GitHub_Copilot?wprov=sfla1)

OpenAI did train the model on GitHub repos. The next question is whether this was enabled by Microsoft's investment in / partnership with OpenAI. I suspect yes, but I haven't gone searching for this yet.

I guess it doesn't matter if they allowed OpenAI to do it or not because it seems other models were allowed to train off it too. I guess we should probably be giving kudos to GitHub and Microsoft for not trying to charge for access to this data.
It was even returning some code verbatim with the correct prompts.