Hacker News new | ask | show | jobs
by turnsout 1163 days ago
So this can tune a model 7X faster than LoRA, which was already a massive speed boost? Curious to see what this will do to the LLaMA-derivative community in particular.
1 comments

7x faster compared to grid-search LoRA for best rank.

I am not convinced that the "best rank" is not just the highest possible with your compute budget, personally.

Highest posssible in which combination, though? If you’re fine tuning a model with N layers, then you could apply LoRA to any or all of them. Maybe it’s better to concentrate effort unevenly, in which case a uniform increase of adaptation rank (to compute budget) could still be subpar.
Right but the way that this paper proposes determining the best rank is by training a LoRA with the full rank.
What is the fastest way to show that?
Fastest way to show what? That you should train with the maximum sized LoRA you can? Because the only upsides to having a smaller LoRA are in the training time, and if you are already able to train a DynLoRA with max rank 8, then you should just train a LoRA with that rank.
You get diminishing returns as you increase the rank, so with a fixed training budget it's not clear whether you get the best return from increasing rank vs increasing something else. If you start off by training DynLORA with max rank 8 you can see returns diminish fast beyond rank 5. Then you can use rank 5 for the rest of your training. You wouldn't know that with LoRA. I think this is the idea behind the paper. If you are just going to use your entire budget training a DyLoRA with max rank 8 then you're right there's no advantage over LoRA with rank 8. You'd have to use the ability to assess multiple ranks in order to see some benefit.
I can see that. But are we sure that a rank-based difference that doesn't manifest early in the training process won't manifest as you get further along? See also 'grokking' [0]

[0]: https://arxiv.org/abs/2201.02177

Not sure there's any way to know beforehand whether that would happen but the advantage of DyLoRA is that at least you will know afterwards whether you really needed the full rank whereas with LoRA you wouldn't? In some cases that might not be valuable information but I guess you'd rather know than not.
Why is the only advantage at training time? I might misunderstand something but with this method you can train once, and then deploy models that use arbitrary rank (according to end-users compute requirements) and expect to have a model that performs best for that specific rank.