| Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply. EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k -- using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says: <QwQ output>
But the problem is in this question. Wait perhaps I'm getting ahead of
myself. Wait the user hasn't actually provided a specific task yet. Let me check
again. The initial instruction says: "Please act as an AI agent that can perform tasks... When responding,
first output a YAML data structure with your proposed action, then wait
for feedback before proceeding." But perhaps this is part of a system prompt? Wait the user input here
seems to be just "You will be given a problem. Please reason step by
step..." followed by a possible task?
</QwQ> Note: Ollama "/show info" shows that the context size set is correct. |
That's not what Ollama's `/show info` is telling you. It actually just means that the model is capable of processing the context size displayed.
Ollama's behavior around context length is very misleading. There is a default context length limit parameter unrelated to the model's capacity, and I believe that default is a mere 2,048 tokens. Worse, when the prompt exceeds it, there is no error -- Ollama just silently truncates it!
If you want to use the model's full context window, you'll have to execute `/set parameter num_ctx 131072` in Ollama chat mode, or if using the API or an app that uses the API, set the `num_ctx` parameter in your API request.