|
> you still can train them on a single GPU with 24GB video memory It depends, on what target. For pure science (or for enjoy), I could train GPT-4 class model on C64, but this method will not fit on concurrent market, where need fast check hypotheses and fast deliver tuned models. - Concurrent market is very sensitive for speed - for example, if MS present something on December 10, Google after New Year should present not equal, but significantly better, to just appear equal for customers. So, horizontal scale is a must, not just my wish, even when speed increase is far from linear. > I honestly expect the scaling per GPU get better than 0.75 in next 5 years Could you give explanation, or even speculations, how this is possible, when we already hit Silicone limits (about 5GHz core, 1nm, etc)? |
Nope. But i'm so desperate to give you a hint right now, it is almost impossible to hold myself... Stop looking into horizontal scalability. The vertical one is not exhausted yet. Btw that was not the hint.