Hacker News new | ask | show | jobs
by parsimo2010 55 days ago
> being able to run a model and being able to run a model fast are two very different thresholds

Specifically speaking, on my Strix Halo machine with (theoretical) memory bandwidth of 256 GB/s, a 70 GB model can't generate faster than 256/70= 3.65 t/s. The logic here is that a dense model must do a full read of the weights for each token. So even if the GPU can keep up, the memory bandwidth is limiting.

A Mac M5 Pro is faster with a bandwidth of 307 GB/s, but that's only a little faster.

This thing is going to be slow on consumer hardware. Maybe that is useful for someone, but I probably prefer a faster model in most cases even if the model isn't quite as smart. Qwen3.6 35B-A3B generates about 50 t/s on my machine, so it can make mistakes, be corrected, and try again in the same time that this model would still be thinking about its first response.

1 comments

Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.
It seems to me it's only Grok 4.20 that does this currently? Which other models did you have in mind, if I may ask?
Gemma4, qwen3.6, deepseek v4, mimo, glm 5/5.1 all do MTP.
Thank you, I just realised we are talking about MTP. It seems that it's not that clear though. "Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"
If Mistral Medium 3.5 supports it, that might get it to 10 t/s. It will still be fairly slow.