| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alexeldeib 251 days ago
	Larger memory, weaker comms. You can optimize for this by doing things like increasing batch size/data parallelism vs sharding schemes with more comms. At scale training won’t be able to avoid comms entirely, while many models can fit in a single MI300 for serving.