| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by itkovian_ 5 days ago
	This is called linear mode connectivity and seems to work for almost every large model. So well that in most cases it’s an explicit part of the training process; do many training ‘branches’ then merge then continue. It is not understood why it works so well.

1 comments

teravor 5 days ago

is that actually how they train them in the datacenter? the trillion sized weight vector gets cloned and sent off to groups of GPUs and averaged after?

link