|
|
|
|
|
by fc417fc802
6 days ago
|
|
But IIUC the point is that each expert gets used for more than just the one token. So yes, the tps of a given thread takes a hit because now you're sometimes going to schedule in unrelated experts and it will have to pause. But overall you're utilizing the hardware much more efficiently and so in aggregate there's a speedup. On top of that (as previously pointed out by zoz) for a single user running a single overarching task the choice of experts is expected to be highly biased. |
|
Why? Why do you think that's the case? Part of the training is balancing load between experts.
> so in aggregate there's a speedup.
Yes. 2x. Over theoretical under 1 tok/s