The closest explanation to how chain of through works is suppressing the probability of a termination token.
People have found that even letting llms generate gibberish tokens produces better final outputs. Which isn't a surprise when you realise that the only way a llm can do computation is by outputting tokens.
Unless you are building one of the frontier models, I’m not sure that your experience gives you insight on those models. Perhaps it just creates needless assumptions.
People have found that even letting llms generate gibberish tokens produces better final outputs. Which isn't a surprise when you realise that the only way a llm can do computation is by outputting tokens.