Hacker News new | ask | show | jobs
by huqedato 824 days ago
Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?
3 comments

LLM performance is about parallelism but also memory bandwidth.

Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].

From the linked reference:

"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."

That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).

If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.

It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:

1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.

2) Nvidia has incredibly high margins.

3) CUDA.

There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.

[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).
> probably less concurrent demand

This is a significant understatement. ChatGPT has an estimated 100m monthly active users.

Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).

It's a totally different approach for interference

In short:

Groq - Ai Chip Microsoft etc. - Nvidia Gpu