|
|
|
|
|
by ajtejankar
911 days ago
|
|
Fixing the experts for a layer might not work since all experts fire almost with equal probability. There are small variations by topic but they are consistent enough to be captured with a simple linear classifier. I believe this happens due to the load balancer loss which forces the model to pick all experts with equal probability. However, what you're saying is a great direction for future MoE's. Can we train MoEs without load balancing so that it is possible to quantize/prune the non-relevant experts more aggressively? We haven't had any major open source MoE's because, as far as I know, they are not straightforward to train, but I expect this to change. |
|
Edit: Oh I read it completely backwards, didn't I. Given the decisions, you can determine the topic of the input.