But a priori you don't know if the code you find on Github is "good", plus it doesn't come with a handy explanation. The quality of the data is much, much worse.
Fair point, but large, popular and well maintained/healthy repos would likely be better to learn from than SO. Lots of stack overflow convos have moved to GitHub issues as well.
My point is that there won't even be any data to steal! The novel human-written and human-rated answers just won't exist anymore. Where will it get its answers on C++26 features from? Not the non-existing StackOverflow, that's for sure.
Ah in the training data sense, yeah that makes sense. My bet is that "code artisans" will see a revival in the 300k+ usd range that will drop into your codebase like a special forces team to unfuck the AI garbage all the prior "Seniors" implemented.
Why does any LLM need new information to do fundamentally the same thing?
And what makes the data outdated? New code? It can train on that. That, or there is simply nothing new to learn, just new ways to express the same thing.
> Why does any LLM need new information to do fundamentally the same thing?
What makes you think we will be doing fundamentally the same thing in the future? Language grow and change, systems change, operating systems change, hardware and specs change..