| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wizee 473 days ago

Ollama defaults to a context of 2048 regardless of model unless you override it with /set parameter num_ctx [your context length]. This is because long contexts make inference slower. In my experiments, QwQ tends to overthink and question itself a lot and generate massive chains of thought for even simple questions, so I'd recommend setting num_ctx to at least 32768.

In my experiments of a couple mechanical engineering problems, it did fairly well in final answers, correctly solving mechanical engineering problems that even DeepSeek r1 (full size) and GPT 4o did wrong in my tests. However, the chain of thought was absurdly long, convoluted, circular, and all over the place. This also made it very slow, maybe 30x slower than comparably sized non-thinking models.

I used a num_ctx of 32768, top_k of 30, temperature of 0.6, and top_p of 0.95. These parameters (other than context length) were recommended by the developers on Hugging Face.

2 comments

zamadatix 472 days ago

I always see:

  /set parameter num_ctx <value>

Explained but never the follow up:

  /save <custom-name>

So you don't have to do the parameter change every load. Is there a better way or is it kind of like setting num_ctx in that "you're just supposed to know"?

link

sReinwald 472 days ago

You can also set

    OLLAMA_CONTEXT_LENGTH=<tokens>

as an environment variable to change ollama's default context length.

link

Tepix 472 days ago

I think that will not work if you use the OpenAI compatible API endpoint.

link

svachalek 472 days ago

I tried this with ollama run, and it had no effect at all.

link

underlines 472 days ago

that env parameter is brand new, did you update ollama?

link

flutetornado 472 days ago

My understanding is that top_k and top_p are two different methods of decoding tokens during inference. top_k=30 considers the top 30 tokens when selecting the next token to generate and top_p=0.95 considers the top 95 percentile. You should need to select only one.

https://github.com/ollama/ollama/blob/main/docs/modelfile.md...

Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"

Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!

link

nodja 472 days ago

They both work on a sorted list of tokens by probability. top_k selects a fixed amount of tokens, top_p selects the top tokens until the sum of probabilities passes the threshold p. So for example if the top 2 tokens have a .5 and .4 probability, then a 0.9 top_p would stop selecting there.

Both can be chained together and some inference engines let you change the order of the token filtering, so you can do p before k, etc. (among all other sampling parameters, like repetition penalty, removing top token, DRY, etc.) each filtering step readjusts the probabilities so they always sum to 1.

link