|
|
|
|
|
by airgapstopgap
915 days ago
|
|
> It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers. No, Mixture-of-Experts is not stacking finetunes of the same base model. |
|
Made sense to mee on first sight to me, because you don't need to train stuff like syntax and grammar 8 times in 8 different ways.
Also would explain why interference of two 7B models has the cost of running a 12B model.