| HN Mirror

Thanks, I need to correct my earlier guess: I believe this demo is running on 9 GroqRacks (576 chips) and I think we may also have an 8 rack version in progress. I can't remember off the top of my head whether this deployment has pipelining of inferences or whether that's work in progress. We've tried a variety of different configurations to improve performance (both latency and throughput), which is possible because of the high level of flexibility and configurability of our architecture and compiler toolchain.

You're right that it is important to compare cost per token also, not just raw speed. Unfortunately I don't have those figures to hand but I think our customer offerings are price competitive with OpenAI's offerings. The biggest takeaway though is that we just don't believe GPU architectures can ever scale to the performance that we can get, at any cost.