| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yarri 766 days ago

Details from their technical memo at https://www.etched.com/announcing-etched

## How can we fit so much more compute on the silicon?

The NVIDIA H200 has 989 TFLOPS of FP16/BF16 compute without sparsity. This is state-of-the-art (more than even Google’s new Trillium chip), and the GB200 launching in 2025 has only 25% more compute (1,250 TFLOPS per die).

Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:

It takes 10,000 transistors to build a single FP16/BF16/FP8 multiply-add circuit, the building block for all matrix math. The H100 SXM has 528 tensor cores, and each has $4 \times 8 \times 16$ FMA circuits. Multiplying tells us the H100 has 2.7 billion transistors dedicated to tensor cores.

*But an H100 has 80 billion transistors! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!*

This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.

By only running transformers, we can fit way more more FLOPS on our chip, without resorting to lower precisions or sparsity.

## Isn’t memory bandwidth the bottleneck on inference?

For modern models like Llama-3, no!

2 comments

torginus 766 days ago

Honestly all this math sounds a bit fishy to me. A H200 has about 5TB/s bandwidth. If we assume a pure matrix multiply workload, we need to fetch 2 FP16 values, which means we are capped at 1.25 TFLOPs. Even best case scenario, where one of the operands is cached, and the other is an FP8, we are only at 5 TB/s which is way less than what the H200 can do.

I don't get how throwing more ALUs at the problem would make things better, it's very much bandwidth constrained.

That's why Groq exists which has a ton of SRAM on chip.

link

aifath 766 days ago

Matmul is cubic compute, but quadratic memory.

For [M, K] @ [K, N] read is O(MK + NK) compute is O(MNK) A quick estimate for compute/bandwidth is min(M, N, K). M is batchsize, so they can just blow that up to get nice looking numbers. On Llama 70B, min(N, K) is 3584 and 7168 for matmul's 1 and 2.

Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small.

link

boznz 765 days ago

nothing about power consumption

link

yarri 765 days ago

This is a datacenter chip. HVAC requirements are more interesting IMO, they seem to be targeting air cooled air edge deployments with that card. They’ll probably wind up with a baseboard design similar to the early v4i TPUs.

https://ieeexplore.ieee.org/document/9499913

link