Hacker News new | ask | show | jobs
by imiric 332 days ago
I'm curious: can you link to any tests that prove this?

I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.

1 comments

Check out the DeepSeek paper.

Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).

I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.

Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.

I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.

But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.

It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.

With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.

When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.

The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.

Reasoning cannot actually be toggled. LLM companies serve completely different models based on whether you have reasoning enabled or disabled for "Opus 4".