|
|
|
|
|
by ActorNightly
1134 days ago
|
|
Its basically almost the same as convolution with image processing. For example, you take the 3 channel rgb value of a single pixel, do some math on it with the values of the surrounding pixels with weights, which gives you some value(s). Depending on the dimensions of everything, you can end up with a smaller dimension output, like a single 3 channel RGB value, or a higher dimension output (i.e for a 5x5 kernel, you can end up with a 9x9 output) The confusing part that doesn't get mentioned is that the input vectors (Q, K, V) are weighted, i.e they are derived from the input with the standard linear transformation where y = A*x+b, where x is the input word, A is the linear layer matrix, and b is the bias. Those weighs are the things that are learned through the training process. |
|