|
|
|
|
|
by xadhominemx
258 days ago
|
|
Cerebras hasn’t made any technical breakthroughs, they are just putting everything in SRAM. It’s a brute force approach to get very high inference throughput but comes at extremely high cost per token per second and is not useful for batched inferencing. Groq uses the same approach. Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics. |
|
[1] https://ieeexplore.ieee.org/abstract/document/9623424