|
|
|
|
|
by xcodevn
606 days ago
|
|
During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization. |
|