| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by iandanforth 615 days ago
	The key bit I didn't understand at first was what happens if the two groups of attention learn the same thing; because their attention masks are subtracted from one another if they both output similar values the attention across the board will drop to zero and this will lead to high loss. So the only way to reduce loss is if they learn to attend to different things. One of the simplest strategies they could learn (and this paper claims that they do) is for one group to focus on relevant context and the other to focus on irrelevant context. Thus one group learns the noise and the other the signal (it's not this cut and dry but is a useful simplification for understanding IMO).

4 comments

magicalhippo 615 days ago

An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax.

This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.

link

testdfkjahdfh 615 days ago

if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?

link

magicalhippo 615 days ago

How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)

And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.

link

patcon 615 days ago

> what happens if the two groups of attention learn the same thing

I wonder if there's a metaphor here for our own experience and utility in "surprise".

Like if one attention head is surprised by what another learns, up-weight it. But if they both find the same, assume it's not very surprising and down-weight it.

Admittedly, "surprise" is something that has a big section of my knowledgebase[1][2][3] (both as a subjective feeling and adaptive function of our minds, one of the most complex adaptive system we know of)

[1] https://plus.maths.org/content/information-surprise

[2] https://blakeelias.name/papers/Multi-Agent-Cooperation-Intri...

[3] https://complexity.simplecast.com/episodes/81/transcript

link

dartos 615 days ago

There’s probably a small chance that they could both learn the same thing, but it’s probably not likely enough to be a major issue.

link

nextaccountic 615 days ago

Maybe the loss function could penalize them learning the same thing?

link