Hacker News new | ask | show | jobs
by wizee 473 days ago
Ollama defaults to a context of 2048 regardless of model unless you override it with /set parameter num_ctx [your context length]. This is because long contexts make inference slower. In my experiments, QwQ tends to overthink and question itself a lot and generate massive chains of thought for even simple questions, so I'd recommend setting num_ctx to at least 32768.

In my experiments of a couple mechanical engineering problems, it did fairly well in final answers, correctly solving mechanical engineering problems that even DeepSeek r1 (full size) and GPT 4o did wrong in my tests. However, the chain of thought was absurdly long, convoluted, circular, and all over the place. This also made it very slow, maybe 30x slower than comparably sized non-thinking models.

I used a num_ctx of 32768, top_k of 30, temperature of 0.6, and top_p of 0.95. These parameters (other than context length) were recommended by the developers on Hugging Face.

2 comments

I always see:

  /set parameter num_ctx <value>
Explained but never the follow up:

  /save <custom-name>
So you don't have to do the parameter change every load. Is there a better way or is it kind of like setting num_ctx in that "you're just supposed to know"?
You can also set

    OLLAMA_CONTEXT_LENGTH=<tokens>
as an environment variable to change ollama's default context length.
I think that will not work if you use the OpenAI compatible API endpoint.
I tried this with ollama run, and it had no effect at all.
that env parameter is brand new, did you update ollama?
My understanding is that top_k and top_p are two different methods of decoding tokens during inference. top_k=30 considers the top 30 tokens when selecting the next token to generate and top_p=0.95 considers the top 95 percentile. You should need to select only one.

https://github.com/ollama/ollama/blob/main/docs/modelfile.md...

Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"

Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!

They both work on a sorted list of tokens by probability. top_k selects a fixed amount of tokens, top_p selects the top tokens until the sum of probabilities passes the threshold p. So for example if the top 2 tokens have a .5 and .4 probability, then a 0.9 top_p would stop selecting there.

Both can be chained together and some inference engines let you change the order of the token filtering, so you can do p before k, etc. (among all other sampling parameters, like repetition penalty, removing top token, DRY, etc.) each filtering step readjusts the probabilities so they always sum to 1.