|
|
|
|
|
by wolttam
102 days ago
|
|
I think the main challenge with combining layers of different would models be their differing embedding sizes and potentially different vocabularies. Even between two models of identical architecture, they may have landed on quite different internal representations if the training data recipe was substantially different. But it would be fun to experiment with. |
|
Nobody here or on Reddit has mentioned this, maybe bc it’s too obvious, but it’s clear to me that the residual connections are an absolutely necessary component to making this merging possible — that’s the only reason dimension 1 of a later layer is encouraged to mean something similar to dimension 1 of an earlier layer.