Hacker News new | ask | show | jobs
by lostmsu 5 days ago
Yeah, that's what talking out of your butt is literally. "theoretical", no ballparks, ignorant assumptions about expert reuse.
1 comments

There's been some very rough experiments with batching on Apple Silicon (and that's not a highly suitable platform since the compute/thermals bottleneck hits sooner than elsewhere) that seem to be broadly consistent with what I argued, showing as much as 2x total decode throughput with an 8-wide batch. That's substantial in this context.
Assuming you magically use all 128GiB of xRAM you need to read ~32GiB per token in batched mode. On a good SSD that would be 1/3 tokens per second. Cool, 2x that you can do 2/3 tokens per second. Let's assume you are lucky and can actually do 6/7 tokens per second. That's still an extremely far cry from 20+ tokens per second of 27B before any batching.
I don't understand where your numbers are coming from. Why is there a 20x (40x?) slowdown of tps after batching?
Before batching. The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass. That's why it is impractically slow.

Batching could allow you to generate 10 tokens for 10 different conversations at the time, but it also means that you need to load different experts for different tokens, so it does not help as much as it does for dense models.

But IIUC the point is that each expert gets used for more than just the one token. So yes, the tps of a given thread takes a hit because now you're sometimes going to schedule in unrelated experts and it will have to pause. But overall you're utilizing the hardware much more efficiently and so in aggregate there's a speedup.

On top of that (as previously pointed out by zoz) for a single user running a single overarching task the choice of experts is expected to be highly biased.

> the choice of experts is expected to be highly biased

Why? Why do you think that's the case? Part of the training is balancing load between experts.

> so in aggregate there's a speedup.

Yes. 2x. Over theoretical under 1 tok/s