| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by embedding-shape 121 days ago
	Have you tried GPT-OSS-120b MXFP4 with reasoning effort set to high? Out of all models I can run within 96GB, it seems to consistently give better results. What exact llama model (+ quant I suppose) is it that you've had better results against, and what did you compare it against, the 120b or 20b variant?

1 comments

magic_hamster 121 days ago

How are you running this? I've had issues with Opencode formulating bad messages when the model runs on llama.cpp. Jinja threw a bunch of errors and GPT-OSS couldn't make tool calls. There's an issue for this on Opencode's repo but seems like it's been waiting or a couple of weeks.

> What exact llama model (+ quant I suppose) is it that you've had better results against

Not llama, but Qwen3-coder-next is on top of my list right now. Q8_K_XL. It's incredible (not just for coding).

link

embedding-shape 121 days ago

Again, you're not specifying what GPT-OSS you're talking about, there are two versions, 20b and 120b. Not to mention if you have a consumer GPU, you're most likely running it with additional quantization too, but you're not saying what version.

> Jinja threw a bunch of errors and GPT-OSS couldn't make tool calls.

This was an issue for a week or two when GPT-OSS initially launched, as none of the inference engines had properly implemented support for it, especially around tool calling. I'm running GPT-OSS-120b MXFP4 with LM Studio and directly with llama.cpp, the recent versions handle it well and I have no errors.

However, when I've tried either 120b or 20b with additional quantization (not the "native" MXFP4 ones), I've seen that they're having troubles with the tool syntax too.

> Not llama

What does your original comment mean then? You said llama was "strictly" better than GPT-OSS, which specific model variant are you talking about or you miswrote somehow?

link

vkaufmann 121 days ago

GPT-OSS-120B runs like hell on my DGX Spark

link

embedding-shape 121 days ago

The MXFP4 variant I suppose? My setup (RTX Pro 6000) does around ~140 tok/s with llama.cpp, around 160 tok/s with vLLM.

link

vkaufmann 121 days ago

yep MXFP4 really fast :D

link