gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)
Qwen3 32B - 65 tok/sec
Qwen3 30B-A3B - 150 tok/sec
(all at 4-bit)
Which model, inference software and hardware are you running it on?
The 30BA3B variant flies on any GPU.
gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)
Qwen3 32B - 65 tok/sec
Qwen3 30B-A3B - 150 tok/sec
(all at 4-bit)