| HN Mirror

I mean, generating more tokens means you use more computing power, and there's som e evidence that not all of these filler words go to waste (esp since they are not really words, but vectors that can carry latent meaning), as models tend to become smarter when allowed to generate a lot of heeming and hawing.

It's been proven that this accidental computation is actually helping CoT models, but they're not supposed to work like that - they're supposed to generate logical observations and use said observations to work further towards the goal (and they primarily do do that).

Considering filler tokens occupy context space and are less useful than meaningful tokens, a model that tries to maximize useful results per amount of compute, you'd want a terse context window without any fluff.