| HN Mirror

That is not 100% what I read in this paper. There are several takes:

1. LoRA makes transformers linear versus pre-training that keeps non-linearity (in 3.1 and 3.2). What is kinda to be expected.[One more insight is that the combination of seemingly linear blocks can lead to non-linear output]. Thus you can replace part of the layers in fine-tuned models by nn.Linear for inference ...

2. There is a way to make LoRA keep non-linearity by changing loss function and improve performance of the model (in 4, Cosine Similarity regularization term).

3. Small models are surprisingly unaffected. But IMO that may be because of small number of layers and weights overall and the adapter layer being much larger in comparison to the model size.