| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lostmsu 10 days ago
	Sounds like you're talking out of your butt instead of doing the math.

1 comments

zozbot234 10 days ago

What do you mean by doing the math? If you repeatedly sample n_active experts out of n_total, why wouldn't you expect to get some meaningful probability of reuse/overlap once your batch grows past size 5 or so (for the sparsest MoE models in common use)? And you only need enough reuse to fill the compute headroom which is quite small on consumer platforms (we won't have huge TOPS numbers for the typical integrated GPU in Strix Halo or even the upcoming RTX Spark). Plus if you're a single user running multiple streams in parallel the choice of experts will be highly biased leading to more reuse.

link

lostmsu 10 days ago

Yeah, that's what talking out of your butt is literally. "theoretical", no ballparks, ignorant assumptions about expert reuse.

link

zozbot234 9 days ago

There's been some very rough experiments with batching on Apple Silicon (and that's not a highly suitable platform since the compute/thermals bottleneck hits sooner than elsewhere) that seem to be broadly consistent with what I argued, showing as much as 2x total decode throughput with an 8-wide batch. That's substantial in this context.

link

lostmsu 9 days ago

Assuming you magically use all 128GiB of xRAM you need to read ~32GiB per token in batched mode. On a good SSD that would be 1/3 tokens per second. Cool, 2x that you can do 2/3 tokens per second. Let's assume you are lucky and can actually do 6/7 tokens per second. That's still an extremely far cry from 20+ tokens per second of 27B before any batching.

link

fc417fc802 9 days ago

I don't understand where your numbers are coming from. Why is there a 20x (40x?) slowdown of tps after batching?

link

lostmsu 9 days ago

Before batching. The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass. That's why it is impractically slow.

Batching could allow you to generate 10 tokens for 10 different conversations at the time, but it also means that you need to load different experts for different tokens, so it does not help as much as it does for dense models.

link