Hacker News new | ask | show | jobs
by pama 800 days ago
Sure but the point of the comment was SRAM. There is some confusion in a subset of the ML people about hardware memories, their latencies, and bandwidths. We don’t all need to write kernels like Tri Dao to make transformers efficient on GPUs, but it would be great if more people were aware of the theoretical compute constraints of each type of model on a given hardware and then a subset of them worked towards building better pipelines.
1 comments

Your parent comment (by my reading) implied the H100 "does just fine" when it has 50MB SRAM.

The reason Grok needs multiple racks of chips to serve up models that fit in a single H100 is because Grok chips are SRAM only while the H100 has 80GB of HBM VRAM bolted onto it in addition to SRAM.

I see. You are right. I also don’t think grok would be friendly to the home user.