| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simne 901 days ago

For about tech questions you asked. You asked right questions, but you missing context.

What really main bottlenecks of NN hardware are neither number crunching, nor memory.

Real bottleneck is that GPT-2 is may be last LLM for which was possible train on one machine (even on one card).

About GPT-3 usually people said about 32-GPUs installations (possible to install into one machine), for GPT-4 scale said about clouds.

And modern clouds are NUMA beasts. I could say, modern clouds networking is slow, but it is not right words, as they are slow as hell.

What all these mean, NN are good target for parallel processing in clouds, but not good enough. Real benchmarks said, mentioned 32-cards machine is about 10 times faster than 1 card with such amount of memory, and when on GPT-4 things scaled, benchmarks become much worse. So, just improve network to move bottleneck to something else and will got additional 50-100x improve.

And with good team of AI scientists, it is more real to make special hardware network for NN processing, or to tune algorithms, than with team of GPU video processing specialized team.

1 comments

pk-protect-ai 901 days ago

> GPT-2 is may be last LLM

This is not true. You have tones of models those are even better than GPT-3.5 and really close in performance to GPT-4 and you still can train them on a single GPU with 24GB video memory. There is a hint at yet better models published last year which you can train on a single GPU and have a model comparable in performance to LLaMA2 34B. The horizontal scaling which you appeal here, may fit into 10^6 performance increase, but in general I expect single node to be at least 1000 times faster than now. And it is totally feasible that you can't scale with 0.99 vertically and of course not horizontally, but I honestly expect the scaling per GPU get better than 0.75 in next 5 years.

link

simne 901 days ago

> you still can train them on a single GPU with 24GB video memory

It depends, on what target. For pure science (or for enjoy), I could train GPT-4 class model on C64, but this method will not fit on concurrent market, where need fast check hypotheses and fast deliver tuned models.

- Concurrent market is very sensitive for speed - for example, if MS present something on December 10, Google after New Year should present not equal, but significantly better, to just appear equal for customers.

So, horizontal scale is a must, not just my wish, even when speed increase is far from linear.

> I honestly expect the scaling per GPU get better than 0.75 in next 5 years

Could you give explanation, or even speculations, how this is possible, when we already hit Silicone limits (about 5GHz core, 1nm, etc)?

link

pk-protect-ai 901 days ago

> Could you give explanation, or even speculations, how this is possible

Nope. But i'm so desperate to give you a hint right now, it is almost impossible to hold myself... Stop looking into horizontal scalability. The vertical one is not exhausted yet. Btw that was not the hint.

link

simne 900 days ago

> Stop looking into horizontal scalability.

Sure. B-747 officially need about 700 man-years so assemble, lets make them with small but highly motivated teams, with classics 3 pizza rule, world will wait :)

link

simne 901 days ago

BTW I was not joking, when said about train LLM on C64. I lot of time seen scientists, who run their tasks on desktop, waiting days or even weeks for results. But they usually have reasons for such behavior, for example, to keep secret from colleagues, on what working now and what calculations show. Or to run something so original, that tops not happy to see on special numbers crunching machine.

link