Just ran llama-bench at home with the similar priced AMD AI PRO R9700 32G. The phoronix numbers look extremely low? Probably I misunderstand their test bench. Anyway, here are some numbers. Maybe someone with access to a B70 can post a comparison.
"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?
In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?
Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?
The parameters in the original gpt-oss-20B model are "post-trained with MXFP4 quantization", so there just isn't much to gain by quantizing to Q8. If you look inside the Q8 model, most of the parameters are MXFP4 anyway.
Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.
Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.
Tried to use the same model as the article:
llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=3867 tg128=175
And a bigger model, because testing a tiny model with a 32GB card feels like a waste:
llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=917 tg128=22