|
|
|
|
|
by uncircle
368 days ago
|
|
> Does this mean that eventually in a world where we all use this stuff, no new language/framework/library will ever be able to emerge? That's a very good question. Rephrased: as good training data will diminish exponentially with the Internet being inundated by LLM regurgitations, will "AI savvy" coders prefer old, boring languages and tech because there's more low-radiation training data from the pre-LLM era? The most popular language/framework combination in early 2020s is JavaScript/React. It'll be the new COBOL, but you won't need an expensive consultant to maintain in the 2100s because LLMs can do it for you. Corollary: to escape the AI craze, let's keep inventing new languages. Lisps with pervasive macro usage and custom DSLs will be safe until actual AGIs that can macroexpand better than you. |
|
I don't think the premise is accurate in this specific case.
First, if anything, training data for newer libs can only increase. Presumably code reaches github in a "at least it compiles" state. So you have lots of people fight the AIs and push code that at least compiles. You can then filter for the newer libs and train on that.
Second, pre-training is already mostly solved. The pudding seems to be now in post-training. And for coding a lot of post-training is done with RL / other unsupervised techniques. You get enough signals from using generate -> check loops that you can do that reliably.
The idea that "we're running out of data" is way too overblown IMO, especially considering the last ~6mo-1y advances we've seen so far. Keep in mind that the better your "generation" pipeline becomes, the better will later models be. And the current "agentic" loop based systems are getting pretty darn good.