|
|
|
|
|
by zozbot234
24 days ago
|
|
That's really only "useless" if the only thing you care about is a quick real-time response. Contrary to common perception, MoE models do benefit from batching requests together even when run on a single node, you just have to ensure you have at least ~5 parallel requests in flight (and that's for the very sparsest models) to really see the aggregate benefit. (Intuitively, that's because the issue of whether any active weights are being shared among requests - thus, any memory throughput is being reused - is a generalized birthday problem. That's why even having a few parallel requests is quite effective. Especially since the "random" choice of experts happens anew at any single layer, so there's a lot of independent samples.) |
|
For prefill, it's really easy to batch MoE and get really good tk/s, even on a single stream.
For decode, you will run into the problem that:
1) you need more parallel requests which means more memory for context
2) 5 requests will not give you very much expert overlap on parallel requests