Hacker News new | ask | show | jobs
by magicalhippo 614 days ago
An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax.

This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.

1 comments

if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?
How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)
And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.