|
Cerebras has been a true revelation when it comes to inference. I have a lot of respect for their founder, team, innovation, and technology. The colossal size of the WS3 chip, utilizing DRAM to mind-boggling scale, it's definitely ultra cool stuff. I also wonder why they have not been acquired yet. Or is it intentional? I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require. I'm not claiming to be an expert, but as a CEO/CTO, there were other providers in the market that had relatively comparable inference speed (obviously Cerebras is #1), easier onboarding, better response from people that worked there (all of my experience with Cerebras have been days/weeks late or simply ignored). IMHO, if Cerebras wants to gain more mindshare, they'll have to look into this aspect. |
1. To achieve high speeds, they put everything on SRAM. I estimated that they needed over $100m of chips just to do Qwen 3 at max context size. You can run the same model with max context size on $1m of Blackwell chips but at a slower speed. Anandtech had an article saying that Cerebras was selling a single chip for around $2-3m. https://news.ycombinator.com/item?id=44658198
2. SRAM has virtually stopped scaling in new nodes. Therefore, new generations of wafer scale chips won’t gain as much as traditional GPUs.
3. Cerebras was designed in the pre-ChatGPT era where much smaller models were being trained. It is practically useless for training in 2025 because of how big LLMs have gotten. It can only do inference but see above 2 problems.
4. To inference very large LLMs economically, Cerebras would need to use external HBM. If it has to reach outside for memory, the benefits of a wafer scale chip greatly diminishes. Remember that the whole idea was to put the entire AI model inside the wafer so memory bandwidth is ultra fast.
5. Chip interconnect technology might make wafer scale chips more redundant. TSMC has a roadmap for glueing more than 2 GPU dies together. Nvidia’s Feynman GPUs might have 4 dies glued together. IE, the sweet spot for large chips might not be wafer scale but perhaps 2, 4, 8 GPUs together.
6. Nvidia seems to be moving much faster in terms of development and responding to market needs. For example, Blackwell is focused on FP4 inferencing now. I suppose the nature of designing and building a wafer scale chip is more complex than a GPU. Cerebras also needs to wait for new nodes to fully mature so that yields can be higher.
There exists a niche where some applications might need super fast token generation regardless of price. Hedge funds and Wallstreet might be good use cases. But it won’t challenge Nvidia in training or large scale inference.