> what happens if the two groups of attention learn the same thing
I wonder if there's a metaphor here for our own experience and utility in "surprise".
Like if one attention head is surprised by what another learns, up-weight it. But if they both find the same, assume it's not very surprising and down-weight it.
Admittedly, "surprise" is something that has a big section of my knowledgebase[1][2][3] (both as a subjective feeling and adaptive function of our minds, one of the most complex adaptive system we know of)