Hacker News new | ask | show | jobs
by fc417fc802 6 days ago
> Why do you think that's the case? Part of the training is balancing load between experts.

That is a fair point. That expectation may have been misplaced on my part. I'm not sufficiently familiar with the details of MoE training.

> The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass.

> 20+ tokens per second of 27B before any batching.

Does the model fit in RAM or not? What is your justification for your stated expectation that the unbatched model will perform 20x faster than the aggregate tps (note, not the single stream tps) of the batched model?

My expectation is that if the unbatched model is 20 tps and batching provides a 2x speedup then each individual stream will be slower but the aggregate throughput should rise to 40 tps. What do you believe me to be missing here?

1 comments

27B does, the op was talking about consumer use of DS v4 Flash, that's 160GB.
It has been quantized to 80GB (2-bit quantization for experts) with limited degradation. Certainly competitive with a 27B model, and especially useful in a size range where few "native" models exist.
> (2-bit quantization for experts) with limited degradation. Certainly competitive with a 27B model

Uh-huh...