| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gaogao 523 days ago
	This approach likely scales to top models. There was a thought for a while among researchers that scaling up Hogwild / LocalSGD to more parameters would be less convergent, but it seems to be that things actually converge a bit better with larger models, especially if a MoE split is used.

1 comments

gaogao 523 days ago

The intuition is that smaller models are figuring out things like grammar where the whole model comes into play, but larger models, especially in the back half of training, have localized knowledge updates that can merge easier in the AllReduce

link