| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by qeternity 385 days ago
	Attention is not the primary inference bottleneck. For each token you have to load all of the weights (or activated weights) from memory. This is why Cerebras is fast: they have huge memory bandwidth.