|
|
|
|
|
by Imnimo
613 days ago
|
|
I feel like I'm missing a key insight here. I understand the problem that regular softmax attention struggles to approach assigning zero attention to irrelevant stuff. And I get that having this subtraction formula makes it possible to assign exactly (or near) zero attention weight without having crazy outlier activations. But it seems like it also makes it very easy to have negative attention weight (which is equivalent to having positive attention weight on the negation of your value vectors). Intuitively, it just feels like a difficult balancing act to keep all the stuff you don't care about so close to zero. But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this. |
|
softmax should be exp()/1+∑exp()
Notice the 1 added to the denominator.
The difference is at the negative limit, softmax can be 0, instead of some epsilon. The same could be done by adding an extra zero value in x.
Downside is, you have to retrain your model from scratch to fix this.