| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DoctorOetker 275 days ago
	Is it not much simpler to parallelize by having different "readers" (using the same model parameters/weights) process different parts of the corpus in parallel? reader A is reading book A, while reader B is reading book B etc...? Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?

2 comments

jsharf 274 days ago

If you have independent copies of the network learning gradients, then you’re effectively making the batch size smaller— unless you’re doing an all collect and making them sync, in which case there’s a lot of overhead

When you take a batch and calculate gradients, you’re effectively calculating a direction the weights should move in, and then taking a step in that direction. You can do more steps at once by doing what you say, but they might not all be exactly in the right direction, so overall efficiency is hard to compare

I am not an expert, but if I understand correctly I think this is the answer.

link

immibis 274 days ago

Batch size is just averaging the gradients from multiple calculations.

link

zozbot234 274 days ago

AIUI, the thinking when developing transformers might have been that "reading text A vs. text B" just isn't parallel enough for truly large-scale training. The problem was to somehow also parallelize the learning of very long range dependencies within a single sequence, and transformers managed to do that.

link