|
Like most things in this new world of Machine Learning, I'm really confused why this works? The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work? |
To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.
So the model already knows what is noise, but a single softmax makes it harder to exclude it.
Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.