|
|
|
|
|
by thomasahle
1059 days ago
|
|
If you take the inner product between a lot of more or less random vectors (the key and query vectors in attention) most values are going to be close to 0. This means they contribute by e^0 to the denominator. Now, if you have a context length of say 2000, your denominator is already ~ 2000. Increasing it to 2001 doesn't really make a difference. Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands. |
|
Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.