Hacker News new | ask | show | jobs
by twic 1177 days ago
Ah, so the resulting model contains both the large matrix of original weights, and also the two small matrices of alterations? But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations.

Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

3 comments

> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

The goal of most parameter-efficient methods is to store one gold copy of the original model, and learn minor modifications/additions to the model. The easiest way to think about this is in some kind of deployment setting, where you have 1 capable model and you learn different sets of LoRA weights for different tasks and applications.

The original intent of parameter-efficient methods is to reduce the amount of storage space needed for models (do you really want to keep a whole additional copy of LLaMA for each different task?). A secondary benefit is that because you are fine-tuning a smaller number of parameters, the optimizer states (can take up to 2x the size of your model) are also heavily shrunk, which makes it more economical (memory-wise) to (parameter-efficient) fine-tune your model.

That’s probably what OpenAI does with their custom fine tuned models, no?
> But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations.

It's actually larger. If you just have two equally large matrices of the same dimension, one original, and one of "altercations"... then you can just add them together.

> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

Then you'd have to compute the gradients for the whole network, which is very expensive when the model has 7b, 65b, 165b parameters. The intent is to make that cheaper by only computing gradients for a low rank representation of the change in the weight matrix from training.

>Then you'd have to compute the gradients for the whole network

You have to do that with LoRA regardless, to compute the gradients for the lowest-level LoRA weights.

Correct me if I'm wrong, but I think you still need to compute gradients of non-trained weights in order to compute the gradients of the LoRA weights. What you don't have to do is store and update the optimizer state for all those non-trained weights.
I mean the derivative of a constant is 0. So if all of the original weights are considered constants, then computing their gradients is trivial, since they’re just zero.
Computing gradients is easy/cheap. What this technique solves is that you no longer need to store the computed values of the gradient until the backpropagation phase, which saves on expensive GPU RAM, allowing you to use commodity hardware.
It's larger, but there are less parameters to train for your specific use case since you are training the small matrix only, while the original ones remain unaltered.