| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by magicalhippo 661 days ago
	An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax. This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.

1 comments

testdfkjahdfh 661 days ago

if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?

link

magicalhippo 660 days ago

How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)

And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.

link