|
|
|
|
|
by evnc
746 days ago
|
|
I'm a bit of a noob here, but if a) a linear SSM (a form of RNN?) is equivalent to Attention without the scaling and softmax; and b) Attention is "all you need" and the thing that made Transformers radically outperform all the previous architectures like LSTMs that used to dominate NLP; does that imply c) the scaling and softmax parts of the attention equation, in particular, is the magic touch that makes Transformers work so well? |
|
An important role is held by the softmax function which normalizes the attention scores, allowing the model to weigh different parts of the input sequence dynamically. This means that, unlike RNNs which sequentially process inputs and update states, Transformers can directly access and prioritize information from any part of the sequence, and they are not slower for T < N.