|
|
|
|
|
by ActorNightly
176 days ago
|
|
People need to get away from this idea of Key/Query/Value as being special. Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer. And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed. This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/. |
|