More tokens = more useful compute towards making a prediction. A query with more tokens before the question is literally giving the LLM more "thinking time"
It correlates but the intuition is a bit misleading. What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.
It's why "RAG" techniques work, the models learn during training to make use of information in context.
At the core of self-attention is dot product measurement which causes the model to act like a search engine.
It's helpful to think about it in terms of search: the shape of the outputs look like conversation but were actually prompting the model to surface information from the QKV matrices internally.
Does it feel familiar? When we brainstorm we usually chart graphs of related concepts e.g. blueberry -> pie -> apple.
>What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.
I'm not saying this isn't part of it but even if it's just dummy tokens without any new information, it works.