Hacker News new | ask | show | jobs
by m3kw9 523 days ago
It’s talking about training 10b parameters “capable models” with less compute using ew techniques, but top models will always need more
2 comments

This approach likely scales to top models. There was a thought for a while among researchers that scaling up Hogwild / LocalSGD to more parameters would be less convergent, but it seems to be that things actually converge a bit better with larger models, especially if a MoE split is used.
The intuition is that smaller models are figuring out things like grammar where the whole model comes into play, but larger models, especially in the back half of training, have localized knowledge updates that can merge easier in the AllReduce
Wow, yeah a 10B parameter model is pretty tiny and 300 3-GPU clusters for $18M is not really cheap.

I guess enormous is in the eye of the beholder.