|
|
|
|
|
by nyrikki
613 days ago
|
|
While I don't discount the value of this, can you expand on the meaning of your claim that it makes the model 'more expressive' Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness. The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this. This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity. Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness. Perhaps I am too focused on generalization? |
|
Maybe expressiveness is not the right term, or not the main consequence. I could imagine that having different subspaces like that also introduces a degree of robustness to out-of-distribution inputs, as this would make it harder for the outputs of one attention head to shift towards the in-distribution outputs of another head, and thus for the following layer to confuse them.