| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lostmsu 53 days ago
	Before batching. The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass. That's why it is impractically slow. Batching could allow you to generate 10 tokens for 10 different conversations at the time, but it also means that you need to load different experts for different tokens, so it does not help as much as it does for dense models.

1 comments

fc417fc802 53 days ago

But IIUC the point is that each expert gets used for more than just the one token. So yes, the tps of a given thread takes a hit because now you're sometimes going to schedule in unrelated experts and it will have to pause. But overall you're utilizing the hardware much more efficiently and so in aggregate there's a speedup.

On top of that (as previously pointed out by zoz) for a single user running a single overarching task the choice of experts is expected to be highly biased.

link

lostmsu 53 days ago

> the choice of experts is expected to be highly biased

Why? Why do you think that's the case? Part of the training is balancing load between experts.

> so in aggregate there's a speedup.

Yes. 2x. Over theoretical under 1 tok/s

link

fc417fc802 53 days ago

> Why do you think that's the case? Part of the training is balancing load between experts.

That is a fair point. That expectation may have been misplaced on my part. I'm not sufficiently familiar with the details of MoE training.

> The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass.

> 20+ tokens per second of 27B before any batching.

Does the model fit in RAM or not? What is your justification for your stated expectation that the unbatched model will perform 20x faster than the aggregate tps (note, not the single stream tps) of the batched model?

My expectation is that if the unbatched model is 20 tps and batching provides a 2x speedup then each individual stream will be slower but the aggregate throughput should rise to 40 tps. What do you believe me to be missing here?

link

lostmsu 52 days ago

27B does, the op was talking about consumer use of DS v4 Flash, that's 160GB.

link

zozbot234 51 days ago

It has been quantized to 80GB (2-bit quantization for experts) with limited degradation. Certainly competitive with a 27B model, and especially useful in a size range where few "native" models exist.

link

lostmsu 51 days ago

> (2-bit quantization for experts) with limited degradation. Certainly competitive with a 27B model

Uh-huh...

link

zozbot234 52 days ago

> Why? Why do you think that's the case? Part of the training is balancing load between experts.

The training balances expert choice across the entire scope of the model. Experiments have consistently shown that within a given session or topic (taken in a broad sense) expert choice is biased in a way that's likely to make caching useful and reuse across a user-specific batch realistic.

link