Hacker News new | ask | show | jobs
by svnt 43 days ago
Look at any recent CoT output where the model is trying to infer from an underspecified prompt what the user wants or means.

It is generally the first thing they do — try to figure out what did you mean with this prompt. When they can’t infer your intent, good models ask follow-on questions to clarify.

I am wondering if this is a semantics issue as this is an established are of research, eg https://arxiv.org/pdf/2501.10871

1 comments

Right, and then look at any number of research papers showing that CoT output has limited impact on the end result. We've trained these models to pretend to reason.
If it's only pretending to reason, then how is it that the CoT output improves performance on every single benchmark/test?
> Right, and then look at any number of research papers showing that CoT output has limited impact on the end result.

Which research papers? Do I have to find them?

> We've trained these models to pretend to reason.

I have no idea why that matters. Can you tell me what the difference is if it looks exactly the same and has the same result?

Examples:

https://arxiv.org/html/2506.02878v1

https://arxiv.org/pdf/2508.01191

Anthropic themselves: https://www.anthropic.com/research/reasoning-models-dont-say...

They were approaching this from an interpretability standpoint, but the more interesting finding in there is that models come up with an answer that fits their training and context provided. CoT is generated to fit the anticipated answer.

In these studies, there are examples of CoT that directly contradicts the response these models ultimately settle on.

This is not reasoning. This is pretense.

The first sentence of the first paper you linked:

"Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models (LLMs) on tasks requiring multi-step inference."

I think it would be helpful if you clarified what exactly you mean because it appears your evidence contradicts your argument.

If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution. If there was actual reasoning involved rather than pattern matching, we would expect to see performance improvements on out of distribution requests. Instead we see longer CoT actually degrade performance on out of distribution tasks.

The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs simply because they don't appear often enough within pre- or post-training datasets despite CoT is just another indicator of them not performing what we would call reasoning or intent inference or whatever other anthropomorphic behavior we want to assign them. They remain spicy autocomplete with the caveat that the RLHF portion of their training _can_ result in goal seeking and problem-solving behavior... in the narrow set of problems that have been explicitly optimized for in their training.

> If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution.

'Demonstrably' means one thing. They said it demonstrably improves outputs. If they want to hedge that with theories about why it would result in the same thing without it then they need to remove that word or come up with a coherent thesis, or I am misunderstanding what you are trying to argue.

> The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs

These are trick questions designed to fool LLMs. It is like saying that people cannot visualize because optical illusions exist, or people don't understand the laws of physics because they fall for magic tricks. It is a failure mode in the way they operate but it doesn't say anything about their operation besides that they fail in that mode for specific reasons.

> They remain spicy autocomplete

And nuclear power plants remain spicy steam generators, but that says nothing actually useful nor offers any insight. Reducing something to its basic mechanism in order to dismiss its output is lazy and thought-terminating.

This is just a no-true-Scotsman defense of reasoning. We were talking about inferring intent.

If someone recorded the inner monologue of human decision-making, would it look like a logician’s workbook? No, I don’t think it would. People like to pretend they are rational.

When they say "pretends to" here they're talking about something quantifiable, that the extra text it outputs for CoT barely feeds back into the decisionmaking at all. In other words it's about as useful as having the LLM make the decision and then "explain" how it got there; the extra output is confabulation.

Though I'm not sure how true that claim is...

You make a good point. I had the impression they were using 'pretend' as a Chinese Room shortcut in that they are asserting that it is incapable of reasoning and only appears to be capable from the outside, which is completely irrelevant and unfalsifiable.