|
|
|
|
|
by edwardjhu
1180 days ago
|
|
Hi! I'm the author of the repo. The insight is that we don't need to modify a lot of parameters to get a generally competent model to do well on specific tasks. When you have a linear layer with a weight matrix of dimension d_in x d_out, the change you undergo during full finetuning is also a matrix of d_in x d_out, which can be huge. We represent the latter using two matrices of shape d_in x r and r x d_out. You save a lot of parameters when r is small. So when you use it, the input goes through two streams 1) the orignal frozen weight turning a vector of size d_in to d_out and 2) the low-rank weights turning a vector of size d_in to r and r to d_out. The two streams are then summed together. (There's a figure in the paper.) This way of doing thing is nice for a few reasons. It's easy to parallelize. You can change r to control how many parameters to train. You can also merge the low-rank weights with the original one to avoid latency. Note that we don't select a subset of the original parameters. We train extra ones. |
|
[0] https://towardsdatascience.com/adding-custom-layers-on-top-o...