|
|
|
|
|
by WithinReason
789 days ago
|
|
The two methods seem to be independent, wonder if you can combine them for even better performance. Interestingly both seem to indirectly modify the optimisation process, in my opinion effectively trying to fix a bad optimiser. Seems like we still have a long way to go after Adam... |
|
A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].
0. https://arxiv.org/pdf/2402.19449 1. https://arxiv.org/pdf/2402.02347