| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by learndeeply 1398 days ago
	Both - the first is for sharding tensors across GPUs, the second is to do an all reduce (e.g. for distributed data parallel to synchronize gradients)