|
|
|
|
|
by ioulaum
518 days ago
|
|
It's not actually a 600B+ model. It's a mixture of experts. The actual models are pretty small and thus don't require as much training to reach a decent point. It's similar to Mixtral having gotten good performance while not having anywhere near OpenAI class money / compute. |
|
Is this described in the paper or was this inferred from the model itself ?
Just curious, especially if the latter.