|
|
|
|
|
by Manabu-eo
779 days ago
|
|
The old google's Switch-C transformer [1] had 2048 experts, 1.6T parameters, with only one activated for each layer, so much more sparse. But also severely undertrained as the other models of that era, and thus useless now. 1. https://huggingface.co/google/switch-c-2048 |
|