|
|
|
|
|
by latchkey
433 days ago
|
|
Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me. I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line. |
|
From die shots and materials I’ve seen, it even looks like ~40% of the die might be allocated to memory [1]. Given that, I’m curious about your point on “not enough die for memory” — is it a matter of absolute capacity still being insufficient for current model sizes, or more about the area-bandwidth tradeoff being unbalanced for inference workloads? Or perhaps something else entirely?
I’d love to understand this design tension more deeply, especially from someone with a high-level view of real-world deployments. Thanks again.
[1] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads — Fig. 5. Die photo of 14nm ASIC implementation of the Groq TSP. https://groq.com/wp-content/uploads/2024/02/2020-Isca.pdf