Hacker News new | ask | show | jobs
by xadhominemx 258 days ago
Cerebras hasn’t made any technical breakthroughs, they are just putting everything in SRAM. It’s a brute force approach to get very high inference throughput but comes at extremely high cost per token per second and is not useful for batched inferencing. Groq uses the same approach.

Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics.

1 comments

I don't think you realize the history of wafer-scale integration and what it means for the chip industry [1]. The approach was famously taken by Gene Amdahl's Trilogy Systems in the 80ies, but failed dramatically leading to (among others) deployment of "accelerator cards" in the form of.. the NVIDIA GeForce 256, the first GPU in 1999. It's not like NVIDIA hasn't been trying to integrate multiple dies in the same package, but doing that successfully has been a huge technological hurdle so far.

[1] https://ieeexplore.ieee.org/abstract/document/9623424

The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.
I understand that topic well. They stitched top metal layers across the reticle - not that challenging, and the foundational IP is not their own.

Everyone else went the CoWoS direction, which enables heterogeneous integration and much more cost effective inference.