Hacker News new | ask | show | jobs
by hashxyz 855 days ago
This is not going to age well. The current round of AI accelerators are going to flop hard because there is a deep hardware software mismatch. All the accelerators target GEMM and CONV, and get bottlenecked when most of the other extremely common tensor operators get mixed in. It turns out that Nvidia GPUs are already pretty close to the ideal type of chip you need to execute models people actually want to use.

Nobody in the AI chip hypespace seems to understand this, it’s just stupid money running around trying to eat Nvidia’s margins. Sam Altman understands this less than plenty of people.

It’s becoming harder for me to see him as anything besides someone who is very talented at growing power, but not much else. Perhaps he will succeed in misallocating a trillion dollars along the way.

10 comments

Ok I guess the team with the largest LLM workload in the world and billions in funding won't understand how to optimise a chip for the exact workload they have and near future ones.
Exactly. Present success means the ability to forecast what’s needed for future success — see the Pierce-Arrow Motor Car Company and their dominance in the market to this very day
This person is not saying success -> more success. I think they’re just pointing out that Altman is smart and is surrounded by smart people and a company that understands the demand because they make up the majority of the demand (and they have a strong thesis).
Is he raising for OpenAI or for another venture? If he is using deep knowledge from OpenAI to raise money for another venture, this sounds wrong.
He is rich and powerful, of course it isn’t wrong

/s

Or broke and powerful? Because of spending a fortune on WorldCoin, working at a nonprofit and heavily investing into early AI startups?
No way OpenAI makes up even a plurality of chip demand
OpenAI not itself, but Microsoft is.

For 2022 and 2023, Microsoft bought a significant portion of NVIDIA's available hardware. They spent quite a bit of 2023 trying to figure out how to even power the multiple fleets of GPUs. Just now with the mild to expected wild adoption of Azure OpenAI are they getting around to servicing all their (potential) customers.

[citation needed]

Seriously, this is am outlandish claim just from looking at Microsoft and Nvidias market cap.

I am sure that Microsoft is gonna be one of Nvidias largest customers, but I sincerely doubt it's even a double digit percentage of their revenue.

This ignores Google's in house chips and their internal usage. They've been at this much longer. I doubt we have the visibility to know how they compare in terms of available flops and the unit costs
Perhaps. I have no idea and am not purporting to know.
Can you elaborate on this? Per ChatGPT:

> Using Pierce-Arrow Motor Car Company as an example of such success is historically inaccurate. Pierce-Arrow was an American automobile manufacturer based in Buffalo, New York, which was known for producing luxury cars. It was indeed a dominant and prestigious brand in the early 20th century. However, the company did not manage to maintain its success and ultimately failed to adapt to changing market conditions. It faced financial difficulties during the Great Depression and eventually went bankrupt in 1938. Pierce-Arrow's inability to forecast and adapt to the economic changes and shifts in consumer preferences of the time led to its decline.

From the very answer ChatGPT gave you, it's evident that GP is saying that current success does not imply future success, using that company as an example. What needs elaboration?
It's pretty clear he is trying to make the opposite point, see "dominance in this market to this very day"
While vertical integration is a great boon for a company, it's hard to pull off. Being an expert in industry X doesn't mean you'll do great in industry Y, even if they are complementary.

Training and designing LLMs doesn't mean you understand the semiconductors business.

Vertical Integration? It may not be an OpenAI project, going by the reporting when he was ousted. I wont be surprised if the plans are for a Muskian incestuous/I-swear-its-not-self-dealing setup, wirh Altman being the CEO of both entities
Correct. They're an LLM team, not chip designers.
Yeah, it's not even like they're running the datacenters where the training and tuning are happening. I would hope some of the people understand what current compute requirements are and perhaps they know better than most what future requirements will be. However, MS has been doing most of the backend for OpenAI and they've been in discussions with actual silicon architecture people (not just NVidia), but those are the folks who would do any implementation.

Perhaps they'll pull off an Apple (for ARM) and do their own architecture (either for training/tuning or inference) that will have a significant effect on the industry, but it seems unlikely. They haven't hired the right people.

The real advantage they might have is insight into how the algorithms can be adapted to reduce power consumption/latency while improving performance. It would seem odd to me, if there weren't more than an order of magnitude in new algorithms for LLMs. You're not going to get 10x the transistors or speed from silicon, but you might get an efficient architecture for a significant algorithmic improvement (that might not just be CUDA).

"I know how machine learning and statistical computing works, therefore I am an expert in hardware design" fallacy.
> "I know how machine learning and statistical computing works, therefore I am an expert in hardware design" fallacy.

A typical case of engineer's disease.

I am guessing an incredibly talented team that is incredibly networked and incredibly well funded and proven agile in the tech hub of the world can find hardware experts. Don’t know why anyone would bet against that.
We would have heard if they had hired/bought the size of team necessary to design a system large enough to be a significant impact. Modern (eve sub 28nm much less 2nm) design is hugely complex and the range of things that an AI compute engine needs to do are very broad.

Perhaps they could design a core and license it out? I'm trying to come up with a way they can do something significant without 100 people. Just the memory and serial connections are complex enough ignoring the GPU or heat/power issues.

It took apple like 10 years to go from their first chips to actually using them in laptops, and they are literally the most well capitalized company on the planet. Sorry if I'm skeptical that some relative up starts with a billion in compute from Microsoft can compete with trillion dollar companies that have been around for decades.
Nobody can even define what AI is, why we need it, or how to achieve it. Usually it makes sense to seek funding to execute on a plan. Making a fancy chat bot that scrapes the web to synthesize sometimes accurate and sometimes useful information is not worth trillions of dollars.

What is essentially happening in my opinion is technical innovation has slowed so silicon valley is seeking money to prop up a house of cards that doesn't make much new that is useful or needed.

Can anyone specifically say what trillions of dollars invested in "AI" would buy for society?

It seems to me there are so many higher priorities.

I wouldn't bet against it but that approach has a remarkably low rate of success. We hear about the winners - survivorship bias is real.
How about something along the lines of AWS and their Graviton?
Graviton - you mean the poorly performing solution that only has a space in the market because amazon sells it as a subsidized cost as part of a larger effort to put pricing pressure on amd/intel? That Graviton?
Was Google a chip designer before the first TPU?
Yes. Google had a number of chip products before that. Some made it to A1 and worked. Just cause they don’t advertise it doesn’t make it not so.
> Yes. Google had a number of chip products before that.

Is that true? I can't find anything suggesting it is. In fact, the little I can find suggests you are incorrect. I'll link them for the sake of referencing sources but they're both pretty awful ad-ridden sites...

A 2016 Tech Radar interview [0] with Norm Jouppi has him quoted as saying:

> [The] Tensor Processing Unit (TPU) is our first custom accelerator ASIC [application-specific integrated circuit] for machine learning [ML], and it fits in the same footprint as a hard drive.

And a 2023 Tom's hardware post [1] begins:

> Google has made significant progress in its endeavor to develop its own data center chips, according to a new report. The Information says that a key milestone has just been reached, which means that Google can plan to roll out server systems powered by the new chips starting from 2025.This is not the first processor that Google has successfully put through R&D - the company has previously made an ASIC for servers and an SoC for mobile devices. The search giant started using its internally developed Tensor Processing Unit (TPU) as far back as 2015.

[0]: https://www.techradar.com/news/computing-components/processo...

[1]: https://www.tomshardware.com/news/google-reaches-self-develo...

I guess it depends on what you are defining as a chip and what you are defining as "Google" -- as in if they have contractors design/build to their needs does that count.

1/ https://www.wired.com/2012/03/google-microsoft-network-gear/

2/ I believe they had a few custom chips designed for the youtube workloads that predate the TPU.

I remember in 2010 there was a building in MV that focused on custom chips.

Said the horse factory when automobiles were being built.
I don't remember LLM's claiming to replace GPU's. This is more like arguing with a landowner why your assembly line is so innovative and needs to be built on their land for free. They need the land, the land doesn't necessarily need them yet.
Pullman Company will disagree with you.
Absolutely terrible analogy.
A LLM might "believe" that horses are built in factories.
It makes sense to ASIC-ify the thing to get lower latencies and make the whole thing cheaper, so MS can run GPT-(n+1) cheaper. But this bet only pays off if the LLM industry gets into the mature stage where costs dominate, not innovation.
The workload they have is already optimized for something like an Nvidia GPU.
I apologise if my response was a little snarky.

Even granted that OpenAI are not able to build a chip that is competitive with NVidia's latest GPUs for running LLMs right away (which is an opinion - not backed by any direct evidence, but I agree that it is plausible as they are going up against a lot of prior R&D) is it not possible that:

a) The unit economics could be so much better that the result is still a major win, e.g. 50% of the performance at 20% of the price.

b) OpenAI is decoupled from existing supply constraints and is able to grow faster and deliver more value. A "worse" chip that you can actually get (in insane volume) may be strategically better than a "superior" chip that is limiting your growth.

c) That the plan might include some elements you are not expecting - at the $trillions investment level they might be looking at doing some surprising things e.g. (I am just making this up but there are a lot of possibilities) buy a memory manufacturer and work directly on increasing memory bandwidth.

From a lay observer point of view of the semiconductor industry of the last two decades, it seems entirely implausible they could do that quickly without just buying a company that was already working on it. And then, unless that company was big enough to already have a significant defensive patent portfolio, it's likely their efforts would be stymied in court for years if it was remotely successful.

The idea that even with expertise, the wins would be so much over what other companies that have hired/bought these companies have been designing for the last 10 years based on very similar requirements (the ones that wrote so much of the foundational research) also seems implausible.

c) It's not actually possible to plan investments at that level with anything more than a very vague direction you're aiming. If it is long term, then everything is changing in unpredictable ways before you get even 25% there, but if you throw so much money at the problem in order to try to solve it much more quickly you are disrupting global economic and geopolitical forces in ways that also can't be planned for.

"50% of the performance at 20% of the price" is wildly implausible even if you can somehow start fabbing perfect chips for openai's workloads tomorrow. Especially if they don't have access to the fabrication processes that nvidia, amd etc are using, since more modern (read: expensive) processes reduce power draw and enable higher clocks. 80% of nv's datacenter die space is not wasted, not close to that much.

It seems more likely to me they'd get 20% of the performance at 50% of the price, and that might still work out for them if it allows them to scale faster without being bottlenecked on supply of existing GPUs. But there's no magic bullet here.

They also still need to source a bunch of other stuff, like RAM, even if they can source their own processors.

Nobody is able to build a chip that is competitive with NVidia's latest GPUs, not even AMD who would be next in line. Look at Google's TPU for a glimpse at a likely outcome of such an endeavor.

What it tells me is that Altman seems to believe that OpenAI can only make the next step if they can throw even more compute at the problem but that that isn't feasible at today's prices.

"The current round" of AI accelerators you are referring to are things that were designed 2015-2022; There are a number of startups (including my own) that are actually designing for the real bottlenecks that differentiate Transformers (plus SSMs and other emerging architectures) from "old" CNNs, RNNs, etc.

Obviously I think my company is doing this in an unique and "correct" way, but I know of half a dozen other companies founded in the past ~18 months that are focused on the memory capacity and bandwidth bottlenecks that exist... the massive failures of the previous decade do not mean that they are going to be repeated.

What can you actually do hardware wise with memory bottleneck except for use faster memory?
Is there any startup which is ready to compete with this: https://www.redsharknews.com/nvidia-wants-to-increase-comput... ?
It is known for electronics designers, that specialized circuits outperforms GPUs for few times.

Before appear Tensor cores, GPUs was about 4 times worse (speed, power consumption).

With Tensor cores, GPUs become better, but they still need to carry video hardware (ramdac, video connectors, 3D processing units, network to connect all this stuff), so they still late.

Really GPUs are interest just because current AI applications are not achieve enough revenue to pay for large scale production of special chips.

I don't know, if Altman have something Big to get revenue to pay for special chips.

Exists speculations that GPT-5 will be enough to replace human at work. If this is real, AI chips will be worth it.

We are indeed talking about a 10^6 factor here ... It's not just 10x or 100x, or even 1000x ... If NVIDIA strips away everything not required from their chips, adds more SDRAM and HBM, it won't improve performance by 100x, maybe they'll make it 10x-15x with this. But they claim they are going to achieve a 10^6x improvement in performance. Even if they end up delivering an ARM-compatible CPU with built-in Tensor core, built-in HBM, and vast SDRAM, without DDR RAM at all, how fast can it be? This promise of 10^6x performance improve is a paradigm shift. They know something that we are not. Or they are just bluffing.
For about tech questions you asked. You asked right questions, but you missing context.

What really main bottlenecks of NN hardware are neither number crunching, nor memory.

Real bottleneck is that GPT-2 is may be last LLM for which was possible train on one machine (even on one card).

About GPT-3 usually people said about 32-GPUs installations (possible to install into one machine), for GPT-4 scale said about clouds.

And modern clouds are NUMA beasts. I could say, modern clouds networking is slow, but it is not right words, as they are slow as hell.

What all these mean, NN are good target for parallel processing in clouds, but not good enough. Real benchmarks said, mentioned 32-cards machine is about 10 times faster than 1 card with such amount of memory, and when on GPT-4 things scaled, benchmarks become much worse. So, just improve network to move bottleneck to something else and will got additional 50-100x improve.

And with good team of AI scientists, it is more real to make special hardware network for NN processing, or to tune algorithms, than with team of GPU video processing specialized team.

> GPT-2 is may be last LLM

This is not true. You have tones of models those are even better than GPT-3.5 and really close in performance to GPT-4 and you still can train them on a single GPU with 24GB video memory. There is a hint at yet better models published last year which you can train on a single GPU and have a model comparable in performance to LLaMA2 34B. The horizontal scaling which you appeal here, may fit into 10^6 performance increase, but in general I expect single node to be at least 1000 times faster than now. And it is totally feasible that you can't scale with 0.99 vertically and of course not horizontally, but I honestly expect the scaling per GPU get better than 0.75 in next 5 years.

Exists one important thing, many people don't aware of. When some good smart team (business or not it is not much important), focus on some task and have corresponding resources, it really could make things, impossible for universal team, targeted for some wide outcome.

What I see, NVIDIA is good, strong team, they bet very high stakes, when made great acquisitions in 2000s and they won. But NVIDIA made wide targeted product, they cannot made very narrow focus on just neural net. So it is possible to make NN product better then NVIDIA.

Real question is to predict, if Altman team could achieve so good economy, to pay expenses for hardware development.

> But they claim they are going to achieve a 10^6x

Classics of management, to ask people more then they could, and they will do most possible, so I don't bother much on such claims.

And also this is teambuilding bs, to motivate people claiming impossible targets.

Will see, how Jensen Huang will use all his diplomatic skills and rhetoric art, to round corners, when become clear, that claimed things impossible.

And this is not first time, such things happen, there are near infinite number of examples. I just few days ago read about IBM 7030 fail, which delivered ~1/10 of claimed, and yesterday people remembered me about Itanium and i960.

Will your arch work for SSMs?
Yes; Mamba was a very easy match, with Hyena also being a good match, but could be greatly optimized with some minimal changes to the model architecture or hardware design.
NVIDIAs margin is someone's money. I wouldn't say they don't understand it. I would say they need a good enough competition to get the margin down.

E.g. FB saying they want to buy 350k H100. That's just a whopping $14B price tag. With a >85% profit margin. While a fab is $20B.

Trillion? Sounds like anchoring to me. Nvidia has a market cap of $1.7T. You could literally buy NVIDIA for that. I read that as "a billion won't cut it, we need quite a few billions".

But it's not unreasonable that those hyperscalers throw in a few billion each.

Usually it's horrible business not to be best (see Intel/AMD). Because the margins are at the top. In this case though they want a whole range of products to go down in margin. Even a slightly worse chip might be worth it if it comes at a significant cost reduction. Especially if the optimal design is known!

In a sense the whole thing can fail at reaching the top or making lots of money and still succeed in bringing total cost down, potentially by 50% or more.

There are probably a lot of optimizations in the silicone and software to find. It's not necessarily obvious what corners can get cut or where, the tradeoffs of tapping out new chips is worth it. Yet Another Matrix Multiplication Chip is not going to set the world on fire. Nvidia has that market pretty well captured.

But perhaps it turns out that subnets can be trained independently or swapped with semantically equivalent but qualitatively different ones. The routing network would effectively "Standardize" and could in principle be well enough understood to "hand optimize" the routing network into hardware. Or maybe back propagation has some novel physical analogue that can be exploited in scales we can access. The real question is if Altman is capable of finding the right path in the notoriously dead end filled field of chip design. His backing of helion [1] didn't bode well in my view. But with enough R&D maybe he will flail into something useful trillions is enough for a lot of flailing.

[1] https://youtu.be/3vUPhsFoniw Edit: more derisive link

Could it be that for today's workloads are perfect for Nvidia GPUs? Not because it is an ideal chip, but rather because of the availability of them, the current workloads are made to take advantage of Nvidia GPUs' architecture.
Most of the workloads have not yet caught up with Nvidia Hopper optimizations. The key are the Tensor Cores.

Google came up with the TPU (2015) for GEMM. Nvidia just took the idea and ran with it (Turing 2018). So it wasn't that Nvidia had a head start on this.

Now Nvidia Hopper is ahead of everybody else by far. They have things like async memory management for the tensor cores (Tensor Memory Accelerator), mixed precission, and even FP8 support.

Most of the software out there has not yet caught up with that. And even Nvidia's own Tensor Engine software is not making the best use of it (Microsoft Research October 2023, backward pass and cross-device communication).

Last year FlashAttention was a game changer for performance by doing memory load optimizations. Nobody was optimizing properly for Nvidia in Transformer models.

Systolic arrays for matrix multiplication go back farther than TPU.
The scale of this should tell us it's not just about building an alternative to Nvidia.

$7 trillion is like adding TSMC, Intel and AMD together, and multiplying that combination by seven.

This is about sheer capacity, not just circumventing CUDA.

Why not just give like a fraction of that to NVidia and tell them "make us more please, we will buy in bulk"?
What they are highly optimized for is mixed-precision GEMM (like all other accelerator manufacturers). What distinguishes Nvidia for now (imo) is that CUDA cores are also quite good at normal code (with control flow etc). I used to think that being close to optimal in one of them would contradict being close to optimal in the other but it turns out they share a lot of resources (SRAM) and the overhead in chip surface if one or the other is laying dormant seems negligible. I'm pretty sure that AMD et al will be sufficiently successful at blatantly copying the CUDA API that we will see serious competition in the next years. The bigger source of uncertainty might actually be fabbing capacity.

I find it hard to argue that this mode supports a 1.7T valuation. I find it hard to believe that for a couple of billions + TSMC credits no one would be able to recreate the CUDA ecosystem + hardware in the medium term.

Doesn't nvidia have huge margins? so if someone just makes a clone of the nvidia gpu then it can erode their margins and drive down the cost of compute
AMD will succeed at this as long as they keep it together.
Everytime I'm tempted to think software is easy compared to hardware, I just remember that AMD is leaving about a trillion dollars worth of market cap on the table, because they haven't figured out a good alternative to CUDA.
They are definetly putting a lot of effort into ROCm & HIP, but definetly accelerating.

ROCm 6 was out Dec 16 (2023), 5.5 was May (2023). 5 was Feb 10 (2022). 4 was Dec 19 (2020)

Fred Brooks wrote in The Mythical Man-Month that it's harder (more time-consuming) to produce the software that corresponds to a given hardware. In 1975.
Hardware was much simpler and less complex then than now. I wonder how or if that's changed by going from hundreds or thousands of transistors to billions.
They’ll need to either reverse engineer CUDA or incentivize reimplementation of everything out there to use ROCm/OpenCL and forgo all the work load optimization done for Nvidia GPUs. I think that’s a non trivial moat.
This has been my perception of AMD for the past 20 years. First against Intel, then ARM, now NVIDIA. "If only ..."
The real bitch is you also need to replicate both the software and convince some large projects (eg, pytorch) to use and support your implementation, and it’s just all rough, very complicated, very fine-grained stuff. The hurdles here are very high.

And if you fuck that part up in any one of a dozen places, no one will use it, because the adoption cost is too high, or your implementation was 20% slower and so everything costs 20% more to use and no one uses it.

This is why you see things like TPUs never really damage NVIDIA, but why basically everyone is focused on open standards and open software. Basically the entire tech industry is using this approach as a way to slowly peel away the layers of this software until enough has been removed that NVIDIA can no longer use it as a moat.

While I doubt OpenAI will be a good fit for semiconductors, my understanding is PyTorch and TensorFlow have been really good at embracing new accelerators, largely due to XLA.

PyTorch, TF, and JAX work great on TPUs. Adoption is low bc they are not really available outside the Google cloud.

AWS uses tricks to accelerate PyTorch with Inferentia/Trainium. Haven’t used it, but I have tried the equivalent for Apple silicon and rage quit after wasting half a day.
I mean, it took almost a decade to get there.
Right, but that was for XLA no? I think (not an expert) that it compiles code from franeworks into a lower-level IR.

That's gotta be way easier, no?

If you are going to go vertical then do it properly.

OpenAI could just build their own framework for internal use that works well on their silicon (see Jax+tpu)

Their starting point? Triton plus some triton libs. Jax chipped away at TF like this, and no reason why Triton can’t do the same to PyTorch.

Competitors don't have access to the process node. You'll get competitors, but they won't be as fast or able to run the latest models. That means they'll compete with older versions of NVIDIA's chips.
Agreed. commoditizing the complement of OpenAIs models.
As far as I know Sam has no technical expertise besides taking money from other non experts who happen to be rich. It is unclear to me why existing GPU manufacturers are not up to the challenge of meeting the needs of "AI" software as you said.
Accelerators have nothing to do with it as we're mostly memory bound by HBM <> SRAM data transfer rather than compute bound.
It depends. Right now once we hit 6-8 bit precision inference, H100s/A100s are not memory-bound, but compute-bound.
This is wrong, being memory bound or not has to do with the dimensions of the matrices being multiplied (if you’re on tensor cores). https://docs.nvidia.com/deeplearning/performance/dl-performa...

Some of the things being done to improve quality of 6-8 bit inference use extra compute and push it a little in the other direction but it’s still pretty memory intense until the batch size gets quite large

It'll help, but GPU crunch isn't caused by people running 6-8bit inference on a single card, but by all the large scale pre-training + fine-tuning runs.
Can you link to an actual performance analysis on this?
Easy. I made tests on desktop core i7-7700 with 64G DDR4-2400. And I've tested 13B..30B..70B models on it, and you may imagine, how easy to manage how many CPU cores used.

Answer is - it is really works, but slow (about 0.5..1 tokens per second, with near 100% CPU usage).

i7-7700 is good weighted machine, but before I few times achieved memory speed bounds with highly optimized software. And it looks very different. When use all cores, I got somewhere about 50% of CPU usage.

BTW Llama.CPU is very good software.

If I’m not mistaken, for parallel inference requests and for prompt preprocessing it’s compute bound.

Also, if you have just a single model you want to optimise (and not the training), you could build an array of asics that do specific matrix computations - then you don’t need to read weights from memory at all.

> Perhaps he will succeed in misallocating a trillion dollars along the way.

Must be really hard, being only a half-billionaire and trying to keep up with Elon's "success"...

It is always a weird take, it happens with Elon Musk all of the time too. Clearly, some people believe they should both be consulting hacker news before making any decisions, because we know better.