| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sir-alien 800 days ago
	The Wizard 8x22B is definitely for the high end, even the 2bit version. I attempted to run it on a workstation with RTX3090 and the performance was as bad as 1 word per 2 seconds. Probably a good candidate for a Groq accelerator.

1 comments

dsrtslnd23 800 days ago

you mean a few hundred Groq accelerators ;-) (they have 230MB SRAM per accelerator)

link

pama 800 days ago

The H100 has 50MB SRAM (L2 cache) and does just fine.

https://docs.nvidia.com/launchpad/ai/h100-mig/latest/h100-mi...

link

kkielhofner 800 days ago

...and 80GB of very high speed VRAM.

link

pama 800 days ago

Sure but the point of the comment was SRAM. There is some confusion in a subset of the ML people about hardware memories, their latencies, and bandwidths. We don’t all need to write kernels like Tri Dao to make transformers efficient on GPUs, but it would be great if more people were aware of the theoretical compute constraints of each type of model on a given hardware and then a subset of them worked towards building better pipelines.

link

kkielhofner 800 days ago

Your parent comment (by my reading) implied the H100 "does just fine" when it has 50MB SRAM.

The reason Grok needs multiple racks of chips to serve up models that fit in a single H100 is because Grok chips are SRAM only while the H100 has 80GB of HBM VRAM bolted onto it in addition to SRAM.

link

pama 800 days ago

I see. You are right. I also don’t think grok would be friendly to the home user.

link