Hacker News new | ask | show | jobs
by testdfkjahdfh 613 days ago
if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?
1 comments

How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)
And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.