Hacker News new | ask | show | jobs
by edwardjhu 1180 days ago
Hi! I'm the author of the repo.

The insight is that we don't need to modify a lot of parameters to get a generally competent model to do well on specific tasks. When you have a linear layer with a weight matrix of dimension d_in x d_out, the change you undergo during full finetuning is also a matrix of d_in x d_out, which can be huge. We represent the latter using two matrices of shape d_in x r and r x d_out. You save a lot of parameters when r is small. So when you use it, the input goes through two streams 1) the orignal frozen weight turning a vector of size d_in to d_out and 2) the low-rank weights turning a vector of size d_in to r and r to d_out. The two streams are then summed together. (There's a figure in the paper.)

This way of doing thing is nice for a few reasons. It's easy to parallelize. You can change r to control how many parameters to train. You can also merge the low-rank weights with the original one to avoid latency.

Note that we don't select a subset of the original parameters. We train extra ones.

3 comments

What is the difference to training an adapter? Or to adding a new task specific layer [0]? Has it been demonstrated that LoRA works best out of all of these approaches?

[0] https://towardsdatascience.com/adding-custom-layers-on-top-o...

Adapters are extra layers inserted between existing layers, so they can't be parallelized. LoRA reparametrizes the weight updates and is easily parallelized or merged with the original weights during inference. Also, if you let the rank r be the hidden size you roughly recover finetuning, so you can see LoRA as a generalization of the latter.

Add a task specific layer and only training that layer doesn't work well. In practice, people combine many of these things, e.g., LoRA + task-specific final layer.

Thanks for the clarification. Does that mean then that when parallelization is not important, training an adapter might be just as good as or better than LoRA?
If latency is irrelevant, I don't think there is a strong practical reason to prefer one over another. (LoRA is more elegant in my biased opinion because you roughly recover finetuning with a large r.) In practice, you see one do a little better on some tasks and vice versa on others as observed by papers after mine.
Hi! I in _no way_ mean to detract or malign or "anything negative" the parent comment (communication is hard!!), BUT I really compliment that exact sentence. :)

My background contains signal processing, "pre-deep learning ML", systems engineering, and firmware, and that sentence jumped out at me as crystal clear in my mind, despite not knowing what HuggingFace is or PyTorch.

Correct me if I'm wrong: These huge models involve lots of weights used in large matrices. The contribution of this work is to plug in some matrix factorization and learn a lower dimensional representation, instead of a large second matrix.

Fantastic!

Also makes me wonder what other performance improvements await through proper application of established and well known Mathematics. :D

Great, we can get authoritative answers. (I'm trying to understand the ML space and have mostly done readings, not an expert.)

I am assuming you can have n LoRA fine-tunings, say each specializing in one aspect of a coherent task, with n summers, running in parallel, and then combine them at the end? Or more generally, does LoRA enable a sort of modularizing around a core (un-merged) model?

And curious if you ever tried merging 2 or more fine-tunings and then testing the resultant single model (merge all) against the original tests to check retention?

This paper tries something like that

https://arxiv.org/pdf/2202.13914.pdf

The gain isn't that significant. We don't understand what these low-rank updates represent, and they might not correspond to "skills" that humans have.