| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Eisenstein 45 days ago

The first sentence of the first paper you linked:

"Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models (LLMs) on tasks requiring multi-step inference."

I think it would be helpful if you clarified what exactly you mean because it appears your evidence contradicts your argument.

1 comments

batshit_beaver 45 days ago

If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution. If there was actual reasoning involved rather than pattern matching, we would expect to see performance improvements on out of distribution requests. Instead we see longer CoT actually degrade performance on out of distribution tasks.

The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs simply because they don't appear often enough within pre- or post-training datasets despite CoT is just another indicator of them not performing what we would call reasoning or intent inference or whatever other anthropomorphic behavior we want to assign them. They remain spicy autocomplete with the caveat that the RLHF portion of their training _can_ result in goal seeking and problem-solving behavior... in the narrow set of problems that have been explicitly optimized for in their training.

Eisenstein 45 days ago

> If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution.

'Demonstrably' means one thing. They said it demonstrably improves outputs. If they want to hedge that with theories about why it would result in the same thing without it then they need to remove that word or come up with a coherent thesis, or I am misunderstanding what you are trying to argue.

> The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs

These are trick questions designed to fool LLMs. It is like saying that people cannot visualize because optical illusions exist, or people don't understand the laws of physics because they fall for magic tricks. It is a failure mode in the way they operate but it doesn't say anything about their operation besides that they fail in that mode for specific reasons.

> They remain spicy autocomplete

And nuclear power plants remain spicy steam generators, but that says nothing actually useful nor offers any insight. Reducing something to its basic mechanism in order to dismiss its output is lazy and thought-terminating.