Hacker News new | ask | show | jobs
by sir-alien 800 days ago
The Wizard 8x22B is definitely for the high end, even the 2bit version. I attempted to run it on a workstation with RTX3090 and the performance was as bad as 1 word per 2 seconds. Probably a good candidate for a Groq accelerator.
1 comments

you mean a few hundred Groq accelerators ;-) (they have 230MB SRAM per accelerator)
The H100 has 50MB SRAM (L2 cache) and does just fine.

https://docs.nvidia.com/launchpad/ai/h100-mig/latest/h100-mi...

...and 80GB of very high speed VRAM.
Sure but the point of the comment was SRAM. There is some confusion in a subset of the ML people about hardware memories, their latencies, and bandwidths. We don’t all need to write kernels like Tri Dao to make transformers efficient on GPUs, but it would be great if more people were aware of the theoretical compute constraints of each type of model on a given hardware and then a subset of them worked towards building better pipelines.
Your parent comment (by my reading) implied the H100 "does just fine" when it has 50MB SRAM.

The reason Grok needs multiple racks of chips to serve up models that fit in a single H100 is because Grok chips are SRAM only while the H100 has 80GB of HBM VRAM bolted onto it in addition to SRAM.

I see. You are right. I also don’t think grok would be friendly to the home user.