| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by MaximilianEmel 859 days ago
	Could it be that for today's workloads are perfect for Nvidia GPUs? Not because it is an ideal chip, but rather because of the availability of them, the current workloads are made to take advantage of Nvidia GPUs' architecture.

3 comments

alecco 858 days ago

Most of the workloads have not yet caught up with Nvidia Hopper optimizations. The key are the Tensor Cores.

Google came up with the TPU (2015) for GEMM. Nvidia just took the idea and ran with it (Turing 2018). So it wasn't that Nvidia had a head start on this.

Now Nvidia Hopper is ahead of everybody else by far. They have things like async memory management for the tensor cores (Tensor Memory Accelerator), mixed precission, and even FP8 support.

Most of the software out there has not yet caught up with that. And even Nvidia's own Tensor Engine software is not making the best use of it (Microsoft Research October 2023, backward pass and cross-device communication).

Last year FlashAttention was a game changer for performance by doing memory load optimizations. Nobody was optimizing properly for Nvidia in Transformer models.

link

cma 858 days ago

Systolic arrays for matrix multiplication go back farther than TPU.

link

panarky 859 days ago

The scale of this should tell us it's not just about building an alternative to Nvidia.

$7 trillion is like adding TSMC, Intel and AMD together, and multiplying that combination by seven.

This is about sheer capacity, not just circumventing CUDA.

link

polishdude20 858 days ago

Why not just give like a fraction of that to NVidia and tell them "make us more please, we will buy in bulk"?

link

WanderPanda 859 days ago

What they are highly optimized for is mixed-precision GEMM (like all other accelerator manufacturers). What distinguishes Nvidia for now (imo) is that CUDA cores are also quite good at normal code (with control flow etc). I used to think that being close to optimal in one of them would contradict being close to optimal in the other but it turns out they share a lot of resources (SRAM) and the overhead in chip surface if one or the other is laying dormant seems negligible. I'm pretty sure that AMD et al will be sufficiently successful at blatantly copying the CUDA API that we will see serious competition in the next years. The bigger source of uncertainty might actually be fabbing capacity.

I find it hard to argue that this mode supports a 1.7T valuation. I find it hard to believe that for a couple of billions + TSMC credits no one would be able to recreate the CUDA ecosystem + hardware in the medium term.

link