Hacker News new | ask | show | jobs
by gliptic 1181 days ago
Correct me if I'm wrong, but I think you still need to compute gradients of non-trained weights in order to compute the gradients of the LoRA weights. What you don't have to do is store and update the optimizer state for all those non-trained weights.
1 comments

I mean the derivative of a constant is 0. So if all of the original weights are considered constants, then computing their gradients is trivial, since they’re just zero.
Computing gradients is easy/cheap. What this technique solves is that you no longer need to store the computed values of the gradient until the backpropagation phase, which saves on expensive GPU RAM, allowing you to use commodity hardware.