Hacker News new | ask | show | jobs
by buildbot 769 days ago
Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.
1 comments

I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:

https://proceedings.mlr.press/v48/diamos16.pdf

Or better yet - on Cerebras

Kudos to groq for writing that kernel