| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Manabu-eo 826 days ago
	The old google's Switch-C transformer [1] had 2048 experts, 1.6T parameters, with only one activated for each layer, so much more sparse. But also severely undertrained as the other models of that era, and thus useless now. 1. https://huggingface.co/google/switch-c-2048