| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by m3kw9 523 days ago
	It’s talking about training 10b parameters “capable models” with less compute using ew techniques, but top models will always need more

2 comments

gaogao 523 days ago

This approach likely scales to top models. There was a thought for a while among researchers that scaling up Hogwild / LocalSGD to more parameters would be less convergent, but it seems to be that things actually converge a bit better with larger models, especially if a MoE split is used.

link

gaogao 523 days ago

The intuition is that smaller models are figuring out things like grammar where the whole model comes into play, but larger models, especially in the back half of training, have localized knowledge updates that can merge easier in the AllReduce

link

kurthr 523 days ago

Wow, yeah a 10B parameter model is pretty tiny and 300 3-GPU clusters for $18M is not really cheap.

I guess enormous is in the eye of the beholder.

link