Hacker News new | ask | show | jobs
by khat 46 days ago
The problem with AI generated code is that the code the data model was trained on almost exclusively comes from public repositories. And there's a lot of repositories that are absolute dog $h!t or out dated. Crap in equals crap out.
3 comments

That isn't how LLM training has worked for some time. There's a reason the LLM boom didn't take off until training was separated into pretraining (training on all data) and posttraining (RLHF to make the output actually aligned).

It's also why model collapse is not a thing despite everyone wanting it to be.

OpenAI and Anthropic spent almost all of 2025 running RL to improve the coding abilities of their models - which involves running thousands of VMs that execute generated code to see if it works.

That's why the code you get from the post-November models is so much better than older models.

ha I had this thought a few months ago made me wonder how a model trained on just John Carmack's code would fair.
Carmack is a smart guy, and there's no question that he's amazing at optimization, but his code is pretty messy, especially early versions.

In the Doom engine, for example, he has hard coded lots of things directly in the C engine code that really should be part of the regular game code.