Hacker News new | ask | show | jobs
by evilduck 78 days ago
In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.

1 comments

In my case with vLLM on dual RTX Pro 6000

gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k

Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)

Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...

Hmm you might be able to tweak the settings further. Under llama.cpp on one RTX 6000 Pro I get ~215 tok/s generation speed. The key for me was setting min_p greater than 0. My settings:

``` #!/bin/bash

llama-server \ -hf ggml-org/gpt-oss-120b-GGUF \ -c 0 \ -np 1 \ --jinja \ --no-mmap \ --temp 1.0 \ --top-p 1.0 \ --min-p 0.001 \ --chat-template-kwargs '{"reasoning_effort": "high"}' \ --host 0.0.0.0 ```