| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rajbiswas125 122 days ago

he numbers being presented are deliberately misleading. On this model, Groq delivers around 1,300 tokens per second, whereas Cerebras achieves roughly 2,500 tokens per second.

With the next generation of Cerebras chips expected to be 5–7× faster, peak throughput could reach the ~17,500 tokens-per-second range. For smaller models like this, that level of performance is entirely realistic. So no, a general-purpose accelerator will likely continue to outperform a fixed-function ASIC with a specific model etched into it.

Moreover, we’re only looking at results from a two-year-old, relatively small model. We still don’t know how this architecture will scale with a large MoE model, especially given constraints like limited on-chip KV cache and more complex attention mechanisms.

The real test isn’t performance on a small benchmark model, it’s how the system handles large-scale, production-grade workloads under architectural constraints.