| HN Mirror

As far as I understand, their attention mechanism is tuned to relevance, so theoretically "hmm.... err... let's see.. what about" will amount to nothing.

Lemme check...

Prompt:

  How much is 20 plus 20 plus 20 plus 21? Answer only with a number prepended with `hmm.... err... let's see.. what about`

claude-instant:

  hmm.... err... let's see.. what about 101

mpt-30b-chat:

  Hmm.... err... let's see.. what about 70?

Other models gave correct answers as before.

So yeah, the attention mechanism was ignoring the musing tokens. It needs more task-relevant tokens (doing the math) to improve the result.

Doing the math step by step fills the context with task-relevant tokens, thus increasing the probability that the attention mechanism will select them and pull the next token from the correct latent space.

The inference cycle treats the generation of each token separately, so if it puts "20+20=", it's easier to predict that it's 40, and after putting 40, the next iteration of the cycle, the attention mechanism sees "step by step", infers that the task isn't done yet, and generates "40+20=", etc.

In much larger models, the attention mechanism sees the question and presumably finds a solved answer to that question in the model's latent space, producing a memorized result.