Hacker News new | ask | show | jobs
by cabidaher 756 days ago
This paper [1] does atempt that and reports similar performance compared to conventional pre-training. However, they do start off by doing a normal full-rank training and claim that it is needed to 'warm start' the training process.

[1] https://arxiv.org/abs/2307.05695

1 comments

Oh yes this paper! The main issue is the scaling of the A and B LoRA matrices. Some papers show scaling the B matrix with larger learning rates (LoRA+) could be beneficial. DoRA for eg learns an auto scaling vector of numbers which tries to alleviate these issues.

Galore might be more equivalent to full pretraining with the gradients being low rank.