|
|
|
|
|
by stygiansonic
816 days ago
|
|
A simplified explanation, which I think I heard from Karpathy, is that transformer models only do computation when they generate (decode) a token. So generating more tokens (using CoT) gives the model more time to “think”. Obviously this doesn’t capture all the nuance. |
|
There's simply a much larger space of possibilities for shorter completions, A B1, A B2, etc. that are plausible. Like if I ask you to give a short reply to a nuanced question, you could reply with a thoughtful answer, a plausible superficially correct sounding answer, convincing BS, etc.
Whereas if you force someone to explain their reasoning, the space of plausible completions reduces. If you start with convincing BS and work through it honestly, you will conclude that you should reverse. (This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.)
This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.