|
|
|
|
|
by cabidaher
756 days ago
|
|
This paper [1] does atempt that and reports similar performance compared to conventional pre-training. However, they do start off by doing a normal full-rank training and claim that it is needed to 'warm start' the training process. [1] https://arxiv.org/abs/2307.05695 |
|
Galore might be more equivalent to full pretraining with the gradients being low rank.