|
|
|
|
|
by bluecoconut
699 days ago
|
|
Nice~ Glad to see this published / confirmed by others. Next I hope to see some of this symmetry used to improve MoE / dynamic compute / adaptive style models! Context: I found the same structure: early - middle - end layers serving different purposes, including the permutability of the middle layers, a year or so ago, but never got to testing more models rigerously or publishing it. We talked about it a bit in a hackernews thread a few months ago. (https://news.ycombinator.com/item?id=39504780#39505523) > One interesting finding though (now that I'm rambling and just typing a lot) is that in a static model, you can "shuffle" the layers (eg. swap layer 4's weights with layer 7's weights) and the resulting tokens roughly seem similar (likely caused by the ResNet style backbone). Only the first ~3 layers and last ~3 layers seem "important to not permute". It kinda makes me interpret models as using the first few layers to get into some "universal" embedding space, operating in that space "without ordering in layer-order", and then "projecting back" to token space at the end. (rather than staying in token space the whole way through). |
|
Maybe it’s crazy, but is there any possibility that, say, Llama and Mistral use the same representation space?