| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nschucher 2203 days ago
	They trained a >600B parameter translation model (GPT-3 is 175B parameters). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (https://arxiv.org/abs/2006.16668)