Hacker News new | ask | show | jobs
by x1000 617 days ago
Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:

If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.

What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.

1 comments

Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention:

- Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))

- Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))

The second one being a bit more drastic and maybe harder to train.