|
|
|
|
|
by famouswaffles
1062 days ago
|
|
Sparse MoE models are neither new nor secret. The only reason you haven't seen much use of them for LLMs is because they would typically well underperform their dense counterparts. Until this paper (https://arxiv.org/abs/2305.14705) indicated they apparently benefit far more from Instruct tuning than dense models, it was mostly a "good on paper" kind of thing. In the paper, you can see the underperformance i'm talking about. Flan-Moe-32b(259b total) scores 25.5% on MMLU pre Instruct tuning and 65.4 after. Flan 62b scores 55% before Instruct tuning and 59% after. |
|