|
|
|
|
|
by daemonologist
190 days ago
|
|
I agree - the results on the finetunes are not very surprising. The trained-from-scratch ResNets (Figure 2 and Section 3.2.1) are definitely more interesting, though somewhat limited in scope. In any case, my impression is that this is not immediately more useful than a LoRA (and is probably not intended to be), but is maybe an avenue for further research. |
|
The ResNet results hold from scratch because strict local constraints (e.g., 3x3 convolutions) force the emergence of fundamental signal-processing features (Gabor/Laplacian filters) regardless of the dataset. The architecture itself enforces the subspace.
The Transformer/ViT results rely on fine-tunes because of permutation symmetry. If you trained two ViTs from scratch, "Attention Head 4" in Model A might be functionally identical to "Head 7" in Model B, but mathematically orthogonal.
Because the authors' method (SVD) lacks a neuron-alignment step, scratch-trained ViTs would not look aligned. They had to use pre-trained models to ensure the weights shared a coordinate system. Effectively, I think that they proved that CNNs converge due to it's arch, but for Transformers, they mostly just confirmed that fine-tuning doesn't drift far from the parent model.