Hacker News new | ask | show | jobs
by benchess 902 days ago
This isn't running on one chip. It's running on 128, or two racks worth of their kit. https://news.ycombinator.com/item?id=38739106

This doesn't mean much without comparing $ or watts of GPU equivalents

3 comments

This article from less than a month ago says that it is on 576 chips https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...
GPUs can't scale single user performance beyond a certain limit. You can throw 100s of GPUs at it but the latency will never be as good.
Thanks, I need to correct my earlier guess: I believe this demo is running on 9 GroqRacks (576 chips) and I think we may also have an 8 rack version in progress. I can't remember off the top of my head whether this deployment has pipelining of inferences or whether that's work in progress. We've tried a variety of different configurations to improve performance (both latency and throughput), which is possible because of the high level of flexibility and configurability of our architecture and compiler toolchain.

You're right that it is important to compare cost per token also, not just raw speed. Unfortunately I don't have those figures to hand but I think our customer offerings are price competitive with OpenAI's offerings. The biggest takeaway though is that we just don't believe GPU architectures can ever scale to the performance that we can get, at any cost.