| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zhangjunphy 379 days ago

I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:

> and a critic model filters the results for genuinely valuable ideas.

In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.

So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.

4 comments

imiric 379 days ago

Exactly.

This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.

The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.

yorwba 379 days ago

Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.

That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.

imiric 379 days ago

Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.

I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.

danenania 379 days ago

> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.

Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.

imiric 379 days ago

I'm curious: can you link to any tests that prove this?

I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.

danenania 379 days ago

Check out the DeepSeek paper.

Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).

I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.

imiric 379 days ago

Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.

I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.

But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.

jacobr1 379 days ago

It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.

With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.

When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.

The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.

rishabhjain1198 378 days ago

Reasoning cannot actually be toggled. LLM companies serve completely different models based on whether you have reasoning enabled or disabled for "Opus 4".

amelius 379 days ago

But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.

Yizahi 379 days ago

Generated code only works because "test" part (compile/validate/analyze etc.) is completely external and written before any mass-market LLMs. There is no such external validator for new theorems, books, pictures, text guides etc. You can't just run hard_reality.exe on a generated poem or a scientific paper to deem it "correct". It is only possible with programming languages, and even then not always.

amelius 379 days ago

Science is falsifiable by definition, and writing poems/books is not the kind of problem of interest here.

> There is no such external validator for new theorems

There are formal logic languages that will allow you to do this.

Yizahi 379 days ago

Your proposed approach to science would result in the extremely tiny subset of math, probably theorems being proven by automation. And it is questionable if those theorems would be even useful. A good mathematician with CS experience can probably write a generator of new useless theorems, something along "are every sequential cube plus square of a number divisible by a root of seventh smallest prime multiplied by logn of than number plus blabla...". One can generate such theorrems and formally prove or disprove them, yes.

On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.

I would say a majority of science is not formally provable.

And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.

jacobr1 379 days ago

> You can't formally prove that some gene sequence is responsible for trait X etc.

Maybe not formally in some kind of mathematical sense. But you certainly could have simulation models of protein synthesis, and maybe even higher order simulation of tissues and organs. You could also let the ai scientist verify the experimental hypothesis by giving access to robotic lab processes. In fact it seems we are going down both fronts right now.

Yizahi 379 days ago

Nobody argues that LLMs aren't useful for some bulk processing of billion datapoints or looking for obscure correlations in the unedited data. But the premise of the Gwern's article is that to be considered thinking, LLM must initiate such search on it's own and arrive to a novel conclusion on it's own.

Basically if:

A) Scientist has an idea > triggers LLM program to sift through a ton of data > LLM print out correlation results > scientist read them and proves/disproves an idea. In this case, while LLM did a bulk of work here, it did not arrive at a breakthrough on its own.

B) LLM is idling > then LLM triggers some API to get some specific set of data > LLM correlates results > LLM prints out a complete hypothesis with proof (or disproves it). In this case we can say that LLM did a breakthrough.

amelius 379 days ago

I think the problem here is that you assume the LLM has to operate isolated from the world, i.e. without interaction. If you put a human scientist in isolation, then you cannot have high expectations either.

Yizahi 379 days ago

I assume not that LLM would be isolated, I assume that LLM would be incapable of interacting in any meaningful way on its own (i.e. not triggered by direct input from a programmer).

yunohn 379 days ago

IME, on a daily basis, Claude Code (supposed SoTA agent) constantly disables and bypasses tests and checks on my codebase - despite following clear prompting guidelines and all the /woo/ like ultrathink etc.

zhangjunphy 379 days ago

I think if we can have a good enough simulation of reality, and a fast one. Something like an accelerable minecraft with real world physics. Then this idea might actually work. But the hard reality we currenly could generate efficiently and feed into LLMs usually has a narrow scope. It feels liking teaching only textbook math to a kid for several years but nothing else. The LLM mostly overoptimize in these very specific fields, but the overall performance might even be worse.

dpoloncsak 379 days ago

Its gotta be G-Mod

leetbulb 379 days ago

There will never be a computer powerful enough to simulate that many paperclips and explosive barrels.

jerf 379 days ago

Those things are being done. Program testing is now off-the-shelf tech, and as for math proofs, see: https://www.geeky-gadgets.com/google-deepmind-alphaproof/

imtringued 379 days ago

That didn't stop actor-critic from becoming one of the most popular deep RL methods.

zhangjunphy 379 days ago

True, and the successful ones usually require an external source of information. For AlphaGo, it is the simple algorithm which decide who is the winner of a game of Go. For GAN, it is the images labled by human. In these scenarios, the critic is the medium which transforms external information into gradient which optimized the actor, but not the direct source of that information.

jsbg 379 days ago

> the LLM tries too hard to satisfy the critic

The LLM doesn't have to know about the critic though. It can just output things and the critic is a second process that filters the output for the end user.