|
|
|
|
|
by usatie
433 days ago
|
|
Thank you for sharing this perspective — really insightful. I’ve been reading up on Groq’s architecture and was under the impression that their chips dedicate a significant portion of die area to on-chip SRAM (around 220MiB per chip, if I recall correctly), which struck me as quite generous compared to typical accelerators. From die shots and materials I’ve seen, it even looks like ~40% of the die might be allocated to memory [1]. Given that, I’m curious about your point on “not enough die for memory” — is it a matter of absolute capacity still being insufficient for current model sizes, or more about the area-bandwidth tradeoff being unbalanced for inference workloads? Or perhaps something else entirely? I’d love to understand this design tension more deeply, especially from someone with a high-level view of real-world deployments. Thanks again. [1] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads — Fig. 5. Die photo of 14nm ASIC implementation of the Groq TSP.
https://groq.com/wp-content/uploads/2024/02/2020-Isca.pdf |
|
This. Additionally, models aren't getting smaller, they are getting bigger and to be useful to a wider range of users, they also need more context to go off of, which is even more memory.
Previously: https://news.ycombinator.com/item?id=42003823
It could be partially the DC, but look at the rack density... to get to an equal amount of GPU compute and memory, you need 10x the rack space...
https://www.linkedin.com/posts/andrewdfeldman_a-few-weeks-ag...
Previously: https://news.ycombinator.com/item?id=39966620
Now compare that to an NV72 and the direction Dell/CoreWeave/Switch are going in with the EVO containment... far better. One can imagine that AMD might do something similar.
https://www.coreweave.com/blog/coreweave-pushes-boundaries-w...