Hacker News new | ask | show | jobs
by Manabu-eo 779 days ago
The old google's Switch-C transformer [1] had 2048 experts, 1.6T parameters, with only one activated for each layer, so much more sparse. But also severely undertrained as the other models of that era, and thus useless now.

1. https://huggingface.co/google/switch-c-2048