Hacker News new | ask | show | jobs
by rughouse 808 days ago
It’s very similar to Mixture of Experts. But instead of routing tokens to multiple experts, you "deploy to a single expert which can be dynamically skipped"
1 comments

Mixing these would be pretty cool. Further reduced compute for MoE while keeping the performance.
In the paper they already show a mixing of these two with Mixture-of-Depths-and-Experts (MoDE).