They were approaching this from an interpretability standpoint, but the more interesting finding in there is that models come up with an answer that fits their training and context provided. CoT is generated to fit the anticipated answer.
In these studies, there are examples of CoT that directly contradicts the response these models ultimately settle on.
If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution. If there was actual reasoning involved rather than pattern matching, we would expect to see performance improvements on out of distribution requests. Instead we see longer CoT actually degrade performance on out of distribution tasks.
The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs simply because they don't appear often enough within pre- or post-training datasets despite CoT is just another indicator of them not performing what we would call reasoning or intent inference or whatever other anthropomorphic behavior we want to assign them. They remain spicy autocomplete with the caveat that the RLHF portion of their training _can_ result in goal seeking and problem-solving behavior... in the narrow set of problems that have been explicitly optimized for in their training.
> If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution.
'Demonstrably' means one thing. They said it demonstrably improves outputs. If they want to hedge that with theories about why it would result in the same thing without it then they need to remove that word or come up with a coherent thesis, or I am misunderstanding what you are trying to argue.
> The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs
These are trick questions designed to fool LLMs. It is like saying that people cannot visualize because optical illusions exist, or people don't understand the laws of physics because they fall for magic tricks. It is a failure mode in the way they operate but it doesn't say anything about their operation besides that they fail in that mode for specific reasons.
> They remain spicy autocomplete
And nuclear power plants remain spicy steam generators, but that says nothing actually useful nor offers any insight. Reducing something to its basic mechanism in order to dismiss its output is lazy and thought-terminating.
This is just a no-true-Scotsman defense of reasoning. We were talking about inferring intent.
If someone recorded the inner monologue of human decision-making, would it look like a logician’s workbook? No, I don’t think it would. People like to pretend they are rational.
"Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models (LLMs) on tasks requiring multi-step inference."
I think it would be helpful if you clarified what exactly you mean because it appears your evidence contradicts your argument.