| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xcodevn 606 days ago
	During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization.