| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EnPissant 248 days ago

Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.

Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.

Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.

Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.

2 comments

hnuser123456 248 days ago

Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.

link

EnPissant 248 days ago

I'm saying its less than half of this DGX Spark: dual channel DDR5-6000 vs quad channel LPDDR5-8000.

link

yvbbrjdr 248 days ago

We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.

link

lostmsu 248 days ago

The parent is right, the issue is on your side.

link