| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fourthark 308 days ago
	Seems like training would be a better match, where you need tons of compute but don’t care about latency.

1 comments

ronsor 308 days ago

No, the problem is that with training, you do care about latency, and you need a crap-ton of bandwidth too! Think of the all_gather; think of the gradients! Inference is actually easier to distribute.

link

meehai 308 days ago

Yeah, but if you can do topologies based on latencies you may get some decent tradeoffs. For example with N=1M nodes each doing batch updates in a tree manner, i.e the all reduce is actually layered by latency between nodes.

link