| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by buildbot 769 days ago
	Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.

1 comments

I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:

Or better yet - on Cerebras

Kudos to groq for writing that kernel