|
|
|
|
|
by ggerganov
247 days ago
|
|
FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark: ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp4096 | 3564.31 ± 9.91 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 53.93 ± 1.71 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp4096 | 1792.32 ± 34.74 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 38.54 ± 3.10 |
|
|
Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...
Example looking at the same weight on Ollama is BF16:
https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360