|
|
|
|
|
by krackers
335 days ago
|
|
Except generating more tokens also effectively extends the computational power beyond the depth of the circuit, which is why chain of thought works in the first place. Even sampling only dummy tokens that don't convey anything still provides more computational power. |
|
It's been proven that this accidental computation is actually helping CoT models, but they're not supposed to work like that - they're supposed to generate logical observations and use said observations to work further towards the goal (and they primarily do do that).
Considering filler tokens occupy context space and are less useful than meaningful tokens, a model that tries to maximize useful results per amount of compute, you'd want a terse context window without any fluff.