Hacker News new | ask | show | jobs
by nyrikki 613 days ago
While I don't discount the value of this, can you expand on the meaning of your claim that it makes the model 'more expressive'

Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness.

The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this.

This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity.

Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness.

Perhaps I am too focused on generalization?

1 comments

What I meant is that by changing lambda each attention head is able to put its outputs in a subspace that is different than that of the other heads. This means that the outputs of different heads do not mingle with each other, and it's easier for the following layer to pick them apart. So I was thinking at increased expressiveness because the attention output can in principle cover a larger volume.

Maybe expressiveness is not the right term, or not the main consequence. I could imagine that having different subspaces like that also introduces a degree of robustness to out-of-distribution inputs, as this would make it harder for the outputs of one attention head to shift towards the in-distribution outputs of another head, and thus for the following layer to confuse them.