|
|
|
|
|
by DoctorOetker
275 days ago
|
|
Is it not much simpler to parallelize by having different "readers" (using the same model parameters/weights) process different parts of the corpus in parallel? reader A is reading book A, while reader B is reading book B etc...? Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable? |
|
When you take a batch and calculate gradients, you’re effectively calculating a direction the weights should move in, and then taking a step in that direction. You can do more steps at once by doing what you say, but they might not all be exactly in the right direction, so overall efficiency is hard to compare
I am not an expert, but if I understand correctly I think this is the answer.