|
|
|
|
|
by Grosvenor
613 days ago
|
|
Regular softmax (and attention) has an error in it. softmax should be exp()/1+∑exp() Notice the 1 added to the denominator. The difference is at the negative limit, softmax can be 0, instead of some epsilon. The same could be done by adding an extra zero value in x. Downside is, you have to retrain your model from scratch to fix this. |
|
if you think about it, the "escape hatch" is the design of the entire transformer dictionary. if Key/Query attention misaligns with Value's weights, you get a layer head that does not attend to anything...