|
|
|
|
|
by EnPissant
248 days ago
|
|
Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama. Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode. Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode. Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster. |
|