Hacker News new | ask | show | jobs
by eaglefield 750 days ago
I think they're talking about linearity as transformations with a linearity score close to 1. They defined that linearity score a little higher up. Such that composing many almost linear transformations will create a total transformation that is very nonlinear.
2 comments

A fun example of that (making a neural network using floating point error as a source of non-linearity): https://youtu.be/Ae9EKCyI1xU?si=n9vgvCvxoxrQeKd8
That is not 100% what I read in this paper. There are several takes:

1. LoRA makes transformers linear versus pre-training that keeps non-linearity (in 3.1 and 3.2). What is kinda to be expected.[One more insight is that the combination of seemingly linear blocks can lead to non-linear output]. Thus you can replace part of the layers in fine-tuned models by nn.Linear for inference ...

2. There is a way to make LoRA keep non-linearity by changing loss function and improve performance of the model (in 4, Cosine Similarity regularization term).

3. Small models are surprisingly unaffected. But IMO that may be because of small number of layers and weights overall and the adapter layer being much larger in comparison to the model size.