| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by trsohmers 865 days ago
	"The current round" of AI accelerators you are referring to are things that were designed 2015-2022; There are a number of startups (including my own) that are actually designing for the real bottlenecks that differentiate Transformers (plus SSMs and other emerging architectures) from "old" CNNs, RNNs, etc. Obviously I think my company is doing this in an unique and "correct" way, but I know of half a dozen other companies founded in the past ~18 months that are focused on the memory capacity and bandwidth bottlenecks that exist... the massive failures of the previous decade do not mean that they are going to be repeated.

3 comments

EvgeniyZh 854 days ago

What can you actually do hardware wise with memory bottleneck except for use faster memory?

link

pk-protect-ai 865 days ago

Is there any startup which is ready to compete with this: https://www.redsharknews.com/nvidia-wants-to-increase-comput... ?

link

simne 864 days ago

It is known for electronics designers, that specialized circuits outperforms GPUs for few times.

Before appear Tensor cores, GPUs was about 4 times worse (speed, power consumption).

With Tensor cores, GPUs become better, but they still need to carry video hardware (ramdac, video connectors, 3D processing units, network to connect all this stuff), so they still late.

Really GPUs are interest just because current AI applications are not achieve enough revenue to pay for large scale production of special chips.

I don't know, if Altman have something Big to get revenue to pay for special chips.

Exists speculations that GPT-5 will be enough to replace human at work. If this is real, AI chips will be worth it.

link

pk-protect-ai 863 days ago

We are indeed talking about a 10^6 factor here ... It's not just 10x or 100x, or even 1000x ... If NVIDIA strips away everything not required from their chips, adds more SDRAM and HBM, it won't improve performance by 100x, maybe they'll make it 10x-15x with this. But they claim they are going to achieve a 10^6x improvement in performance. Even if they end up delivering an ARM-compatible CPU with built-in Tensor core, built-in HBM, and vast SDRAM, without DDR RAM at all, how fast can it be? This promise of 10^6x performance improve is a paradigm shift. They know something that we are not. Or they are just bluffing.

link

simne 863 days ago

For about tech questions you asked. You asked right questions, but you missing context.

What really main bottlenecks of NN hardware are neither number crunching, nor memory.

Real bottleneck is that GPT-2 is may be last LLM for which was possible train on one machine (even on one card).

About GPT-3 usually people said about 32-GPUs installations (possible to install into one machine), for GPT-4 scale said about clouds.

And modern clouds are NUMA beasts. I could say, modern clouds networking is slow, but it is not right words, as they are slow as hell.

What all these mean, NN are good target for parallel processing in clouds, but not good enough. Real benchmarks said, mentioned 32-cards machine is about 10 times faster than 1 card with such amount of memory, and when on GPT-4 things scaled, benchmarks become much worse. So, just improve network to move bottleneck to something else and will got additional 50-100x improve.

And with good team of AI scientists, it is more real to make special hardware network for NN processing, or to tune algorithms, than with team of GPU video processing specialized team.

link

pk-protect-ai 862 days ago

> GPT-2 is may be last LLM

This is not true. You have tones of models those are even better than GPT-3.5 and really close in performance to GPT-4 and you still can train them on a single GPU with 24GB video memory. There is a hint at yet better models published last year which you can train on a single GPU and have a model comparable in performance to LLaMA2 34B. The horizontal scaling which you appeal here, may fit into 10^6 performance increase, but in general I expect single node to be at least 1000 times faster than now. And it is totally feasible that you can't scale with 0.99 vertically and of course not horizontally, but I honestly expect the scaling per GPU get better than 0.75 in next 5 years.

link

simne 862 days ago

> you still can train them on a single GPU with 24GB video memory

It depends, on what target. For pure science (or for enjoy), I could train GPT-4 class model on C64, but this method will not fit on concurrent market, where need fast check hypotheses and fast deliver tuned models.

- Concurrent market is very sensitive for speed - for example, if MS present something on December 10, Google after New Year should present not equal, but significantly better, to just appear equal for customers.

So, horizontal scale is a must, not just my wish, even when speed increase is far from linear.

> I honestly expect the scaling per GPU get better than 0.75 in next 5 years

Could you give explanation, or even speculations, how this is possible, when we already hit Silicone limits (about 5GHz core, 1nm, etc)?

link

simne 862 days ago

BTW I was not joking, when said about train LLM on C64. I lot of time seen scientists, who run their tasks on desktop, waiting days or even weeks for results. But they usually have reasons for such behavior, for example, to keep secret from colleagues, on what working now and what calculations show. Or to run something so original, that tops not happy to see on special numbers crunching machine.

link

simne 863 days ago

Exists one important thing, many people don't aware of. When some good smart team (business or not it is not much important), focus on some task and have corresponding resources, it really could make things, impossible for universal team, targeted for some wide outcome.

What I see, NVIDIA is good, strong team, they bet very high stakes, when made great acquisitions in 2000s and they won. But NVIDIA made wide targeted product, they cannot made very narrow focus on just neural net. So it is possible to make NN product better then NVIDIA.

Real question is to predict, if Altman team could achieve so good economy, to pay expenses for hardware development.

link

simne 863 days ago

> But they claim they are going to achieve a 10^6x

Classics of management, to ask people more then they could, and they will do most possible, so I don't bother much on such claims.

And also this is teambuilding bs, to motivate people claiming impossible targets.

Will see, how Jensen Huang will use all his diplomatic skills and rhetoric art, to round corners, when become clear, that claimed things impossible.

And this is not first time, such things happen, there are near infinite number of examples. I just few days ago read about IBM 7030 fail, which delivered ~1/10 of claimed, and yesterday people remembered me about Itanium and i960.

link

cma 864 days ago

Will your arch work for SSMs?

link

trsohmers 864 days ago

Yes; Mamba was a very easy match, with Hyena also being a good match, but could be greatly optimized with some minimal changes to the model architecture or hardware design.

link