Hacker News new | ask | show | jobs
by marci 5 hours ago

  "That’s where EMO comes in.

  We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."
https://allenai.org/blog/emo