Hacker News new | ask | show | jobs
by phamilton 3 days ago
MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.