|
|
|
|
|
by slashcom
2682 days ago
|
|
There’s a natural way to parallelize these models so that using 128 GPUs is the same as a 128x batch size. You can similarly simulate 128x batch size by accumulated gradients before backpropping. So you can test on just one or a few GPUs before you run the full thing. By that point you know it’s going to work, it’s just a matter of how well and whether you could’ve done nominally better with different tuning. There’s been enough research leading up to this paper to suspect that just scaling larger would play out. |
|
>By that point you know it’s going to work, it’s just a matter of how well and whether you could’ve done nominally better with different tuning.
This can't be true in all cases, right? I'm assuming that for many initially promising results on less-compute when they scale it, the results aren't impressive. I'm very curious to know what is the trials-to-success rate of publishable results when big-compute is thrown in the mix.