|
|
|
|
|
by neodypsis
788 days ago
|
|
> Seems like we still have a long way to go after Adam... A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1]. 0. https://arxiv.org/pdf/2402.19449
1. https://arxiv.org/pdf/2402.02347 |
|