Hacker News new | ask | show | jobs
by vessenes 47 days ago
Interesting little bit of history; this pre-Chinchilla paper proposed MoE training longer would improve performance. Good idea. They also proposed using a hash function to choose experts rather than training a routing layer and showed it marginally better at the time than existing routing techniques.

I’d guess that the hash function worked better because by definition it does not collapse; a modern training run of an MoE model will include careful attention to usage of experts, and expect some to be more ‘hot’ than others — e.g. totally flat percentage choice is a bad sign, and also look for unused or radically underutilized experts as well.