| HN Mirror

Even with the same embedding sizes and vocabularies, there’s nothing that forces the meaning of dimension 1 of model 1 to mean the same thing as dimension 1 of model 2 — there are lots of ways to permute the dimensions of a model without changing its output, so whatever dimension 1 means the first time you train a model is just as likely to end up as dimension 2 the second time you train is as it is to be consistent with the first model.

Nobody here or on Reddit has mentioned this, maybe bc it’s too obvious, but it’s clear to me that the residual connections are an absolutely necessary component to making this merging possible — that’s the only reason dimension 1 of a later layer is encouraged to mean something similar to dimension 1 of an earlier layer.