| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dcre 140 days ago
	This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data. I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.

3 comments

elevation 140 days ago

> it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there

The quality of the existing code base makes a huge difference. On a recent greenfield effort, Claude emitted an MVP that matched the design semantics, but the code was not up to standards. For example, it repeatedly loaded a large file into memory in different areas where it was needed (rather than loading once and passing a reference.)

However, after an early refactor, the subsequently generated code vastly improved. It honors the testing and performance paradigms, and it's so clean there's nothing for the linter to do.

link

thesz 140 days ago

  > the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

This one was on HN recently: https://spectrum.ieee.org/ai-coding-degrades

Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.

link

dcre 140 days ago

Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

And their “explanation” blaming the training data is just a guess on their part, one that I suspect is wrong. There is no argument given that that’s the actual cause of the observed phenomenon. It’s a just-so story: something that sounds like it could explain it but there’s no evidence it actually does.

My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said. I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

link

thesz 140 days ago

  > Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

I see "No True Scotsman" argument above.

  > My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said.

Reinforcement learning reinforces what is already in the LM, makes width of search path of possible correct answer narrower and wider search path in not-RL-tuned base models results in more correct answers [1].

[1] https://openreview.net/forum?id=4OsgYD7em5

  > I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

The sources of training data already were the reasons for allegations, even leading to lawsuits. So I would suspect that no engineer from any LLM company would disclose anything on their sources of training data besides innocently sounding "synthetic data verified by ourselves."

From the days I have worked on blockchains, I am very skeptical about any company riding any hype. They face enormous competition and they will buy, borrow or steal their way to try to not go down even a little. So, until Anthropic opens the way they train their model so that we can reproduce their results, I will suspect they leaked test set into it and used users code generation conversation as new source of training data.

link

dcre 140 days ago

That is not what No True Scotsman is. I’m pointing out a bad argument with weak evidence.

link

thesz 140 days ago

  >>> It is a very contrived under-specified prompt.

No True Prompt can be such contrived and underspecified.

The article about degradation is a case study (single prompt), weakest of the studies in hierarchy of knowledge. Case studies are basis for further, more rigorous studies. And author took the time to test his assumptions and presented quite clear evidence that such degradation might be present and that we should investigate.

link

dcre 139 days ago

We have investigated. Millions of people are investigating all the time and finding that the coding capacity has improved dramatically over that time. A variety of very different benchmarks say the same. This one random guy’s stupid prompt says otherwise. Come on.

link

nextos 140 days ago

Progress with RL is very interesting, but it's still too inefficient. Current models do OK on simple boring linear code. But they output complete nonsense when presented with some compact but mildly complex code, e.g. a NumPyro model with some nesting and einsums.

For this reason, to be truly useful, model outputs need to be verifiable. Formal verification with languages like Dafny , F*, or Isabelle might offer some solutions [1]. Otherwise, a gigantic software artifact such as a compiler is going to have a critical correctness bugs with far-fetched consequences if deployed in production.

Right now, I think treating a LLM like something different than a very useful information retrieval system with excellent semantic capabilities is not something I am comfortable with.

[1] https://risemsr.github.io/blog/2026-02-04-nik-agentic-pop

link

dcre 140 days ago

Human-written compilers have bugs too! It takes decades of use to iron them out, and we’re introducing new ones all the time.

link