Hacker News new | ask | show | jobs
by erikaww 808 days ago
Mixing these would be pretty cool. Further reduced compute for MoE while keeping the performance.
1 comments

In the paper they already show a mixing of these two with Mixture-of-Depths-and-Experts (MoDE).