| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xadhominemx 258 days ago
	Cerebras hasn’t made any technical breakthroughs, they are just putting everything in SRAM. It’s a brute force approach to get very high inference throughput but comes at extremely high cost per token per second and is not useful for batched inferencing. Groq uses the same approach. Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics.

1 comments

twothreeone 258 days ago

I don't think you realize the history of wafer-scale integration and what it means for the chip industry [1]. The approach was famously taken by Gene Amdahl's Trilogy Systems in the 80ies, but failed dramatically leading to (among others) deployment of "accelerator cards" in the form of.. the NVIDIA GeForce 256, the first GPU in 1999. It's not like NVIDIA hasn't been trying to integrate multiple dies in the same package, but doing that successfully has been a huge technological hurdle so far.

[1] https://ieeexplore.ieee.org/abstract/document/9623424

link

averne_ 258 days ago

The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.

link

xadhominemx 258 days ago

I understand that topic well. They stitched top metal layers across the reticle - not that challenging, and the foundational IP is not their own.

Everyone else went the CoWoS direction, which enables heterogeneous integration and much more cost effective inference.

link