Hacker News new | ask | show | jobs
by ajtejankar 911 days ago
Fixing the experts for a layer might not work since all experts fire almost with equal probability. There are small variations by topic but they are consistent enough to be captured with a simple linear classifier. I believe this happens due to the load balancer loss which forces the model to pick all experts with equal probability. However, what you're saying is a great direction for future MoE's. Can we train MoEs without load balancing so that it is possible to quantize/prune the non-relevant experts more aggressively? We haven't had any major open source MoE's because, as far as I know, they are not straightforward to train, but I expect this to change.
1 comments

I could be reading this wrong, but doesn't the article say that the MMLU topic is enough to determine which experts got picked 96% of the time?

Edit: Oh I read it completely backwards, didn't I. Given the decisions, you can determine the topic of the input.