| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cabidaher 803 days ago
	This paper [1] does atempt that and reports similar performance compared to conventional pre-training. However, they do start off by doing a normal full-rank training and claim that it is needed to 'warm start' the training process. [1] https://arxiv.org/abs/2307.05695

1 comments

danielhanchen 803 days ago

Oh yes this paper! The main issue is the scaling of the A and B LoRA matrices. Some papers show scaling the B matrix with larger learning rates (LoRA+) could be beneficial. DoRA for eg learns an auto scaling vector of numbers which tries to alleviate these issues.

Galore might be more equivalent to full pretraining with the gradients being low rank.

link