Hacker News new | ask | show | jobs
by iandanforth 615 days ago
The key bit I didn't understand at first was what happens if the two groups of attention learn the same thing; because their attention masks are subtracted from one another if they both output similar values the attention across the board will drop to zero and this will lead to high loss. So the only way to reduce loss is if they learn to attend to different things. One of the simplest strategies they could learn (and this paper claims that they do) is for one group to focus on relevant context and the other to focus on irrelevant context. Thus one group learns the noise and the other the signal (it's not this cut and dry but is a useful simplification for understanding IMO).
4 comments

An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax.

This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.

if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?
How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)
And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.
> what happens if the two groups of attention learn the same thing

I wonder if there's a metaphor here for our own experience and utility in "surprise".

Like if one attention head is surprised by what another learns, up-weight it. But if they both find the same, assume it's not very surprising and down-weight it.

Admittedly, "surprise" is something that has a big section of my knowledgebase[1][2][3] (both as a subjective feeling and adaptive function of our minds, one of the most complex adaptive system we know of)

[1] https://plus.maths.org/content/information-surprise

[2] https://blakeelias.name/papers/Multi-Agent-Cooperation-Intri...

[3] https://complexity.simplecast.com/episodes/81/transcript

There’s probably a small chance that they could both learn the same thing, but it’s probably not likely enough to be a major issue.
Maybe the loss function could penalize them learning the same thing?