Hacker News new | ask | show | jobs
by gaogao 523 days ago
This approach likely scales to top models. There was a thought for a while among researchers that scaling up Hogwild / LocalSGD to more parameters would be less convergent, but it seems to be that things actually converge a bit better with larger models, especially if a MoE split is used.
1 comments

The intuition is that smaller models are figuring out things like grammar where the whole model comes into play, but larger models, especially in the back half of training, have localized knowledge updates that can merge easier in the AllReduce