|
|
|
|
|
by stu2b50
1178 days ago
|
|
> But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations. It's actually larger. If you just have two equally large matrices of the same dimension, one original, and one of "altercations"... then you can just add them together. > Why is fine-tuning done with separate alterations, rather than by mutating the original weights? Then you'd have to compute the gradients for the whole network, which is very expensive when the model has 7b, 65b, 165b parameters. The intent is to make that cheaper by only computing gradients for a low rank representation of the change in the weight matrix from training. |
|
You have to do that with LoRA regardless, to compute the gradients for the lowest-level LoRA weights.