Hacker News new | ask | show | jobs
by whimsicalism 1163 days ago
I'm unsure of the value of dynamically reducing the rank of the LoRA matrix at inference time given that probably most of the parameter count comes from the original weights rather than the LoRA diff.

But nonetheless, training time improvements look interesting.

e: Oh I see, the training time improvement is compared to a grid search over the LoRA rank. Not for a single run.

I am not convinced that you shouldn't just train on the highest possible rank that you can with your compute budget. If you can train a DynLoRA with rank 8, why not just train a LoRA with that rank?

2 comments

Yea, this is interesting but I can't see the immidiate value (not that there isn't).

Maybe if the "optimal rank" of LORA applies to any adaptation and you interested in training multiple adaptations for different use cases?

The optimal rank could differ across layers
I would be shocked if the "optimal rank" in terms of performance wouldn't be using the maximum rank from the DynLoRA across all layers.
Err, I suppose trivially, the higher rank terms include the lower-rank subnets, so they dominate in terms of quality.

But if you have some capacity constraint (e.g., memory, I guess?) then you can imagine dynamic rank allocation helping in the case where the maximum rank across all layers isn't within budget.

It's a bit of a stretch though, I agree

As someone else mentioned [0], the procedure would basically be to train a DyLoRA for an initial few iterations, then do a search among the layers to find the best scoring combination of ranks, and then train pruned to just use those ranks to completion.

Seems complicated but I could see it being useful potentially.

[0]: https://news.ycombinator.com/item?id=35517353