|
|
|
|
|
by rileyphone
806 days ago
|
|
With an MoE you only need to train a smaller model which you can then combine into an x8 and finetune/train the router. Mistral used their 7B base to make Mixtral, Qwen's new MoE uses their 1.8B model upscaled to 2.7B, pretty sure Grok also trained a smaller model first. |
|