| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by latchkey 433 days ago
	Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me. I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.

2 comments

usatie 433 days ago

Thank you for sharing this perspective — really insightful. I’ve been reading up on Groq’s architecture and was under the impression that their chips dedicate a significant portion of die area to on-chip SRAM (around 220MiB per chip, if I recall correctly), which struck me as quite generous compared to typical accelerators.

From die shots and materials I’ve seen, it even looks like ~40% of the die might be allocated to memory [1]. Given that, I’m curious about your point on “not enough die for memory” — is it a matter of absolute capacity still being insufficient for current model sizes, or more about the area-bandwidth tradeoff being unbalanced for inference workloads? Or perhaps something else entirely?

I’d love to understand this design tension more deeply, especially from someone with a high-level view of real-world deployments. Thanks again.

[1] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads — Fig. 5. Die photo of 14nm ASIC implementation of the Groq TSP. https://groq.com/wp-content/uploads/2024/02/2020-Isca.pdf

link

latchkey 433 days ago

> is it a matter of absolute capacity still being insufficient for current model sizes

This. Additionally, models aren't getting smaller, they are getting bigger and to be useful to a wider range of users, they also need more context to go off of, which is even more memory.

Previously: https://news.ycombinator.com/item?id=42003823

It could be partially the DC, but look at the rack density... to get to an equal amount of GPU compute and memory, you need 10x the rack space...

https://www.linkedin.com/posts/andrewdfeldman_a-few-weeks-ag...

Previously: https://news.ycombinator.com/item?id=39966620

Now compare that to an NV72 and the direction Dell/CoreWeave/Switch are going in with the EVO containment... far better. One can imagine that AMD might do something similar.

https://www.coreweave.com/blog/coreweave-pushes-boundaries-w...

link

usatie 431 days ago

Thanks for the links — I went through all of them (took me a while). The point about rack density differences between SRAM-based systems like Cerebras or Groq and GPU clusters is now clear to me.

What I’m still trying to understand is the economics.

From this benchmark: https://artificialanalysis.ai/models/llama-4-scout/providers...

Groq seems to offer near lowest prices per million tokens and the near fastest end to end response times. That’s surprising because in my understanding, speed(latency) and the cost are trade-offs.

So I’m wondering: Why can’t GPU-based providers can't offer cheaper but slower(high-latency) APIs? Or do you think Groq/Cerebras are pricing much below cost (loss-leader style)?

link

latchkey 431 days ago

Loss leader. It is uber/airbnb. Book revenue, regardless of economics, and then debt finance against that. Hope one day to lock in customers, or raise prices, or sell the company.

link

heymijo 433 days ago

> they told me their largest customers are for training, not inference

That is curious. Things are moving so quickly right now. I typed out a few speculative sentences then went ahead and asked an LLM.

Looks like Cerebras is responding to the market and pivoting towards a perceived strength of their product combined with the growth in inference, especially with the advent of reasoning models.

link

latchkey 433 days ago

I wouldn't call it "pivoting" as much as "marketing".

link