|
|
|
|
|
by NitpickLawyer
368 days ago
|
|
> Rephrased: as good training data will diminish exponentially with the Internet being inundated by LLM regurgitations I don't think the premise is accurate in this specific case. First, if anything, training data for newer libs can only increase. Presumably code reaches github in a "at least it compiles" state. So you have lots of people fight the AIs and push code that at least compiles. You can then filter for the newer libs and train on that. Second, pre-training is already mostly solved. The pudding seems to be now in post-training. And for coding a lot of post-training is done with RL / other unsupervised techniques. You get enough signals from using generate -> check loops that you can do that reliably. The idea that "we're running out of data" is way too overblown IMO, especially considering the last ~6mo-1y advances we've seen so far. Keep in mind that the better your "generation" pipeline becomes, the better will later models be. And the current "agentic" loop based systems are getting pretty darn good. |
|
How?
Presumably in the "every coder is using AI assistants" future, it will be an incredible amount of friction to get people to adopt languages that AI assistants don't know anything about
So how does the training data for a new language get made, if no programmers are using the language, because the AI tools that all programmers rely on aren't trained on the language?
The snake eating its own tail