| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WithinReason 789 days ago
	The two methods seem to be independent, wonder if you can combine them for even better performance. Interestingly both seem to indirectly modify the optimisation process, in my opinion effectively trying to fix a bad optimiser. Seems like we still have a long way to go after Adam...

1 comments

neodypsis 789 days ago

> Seems like we still have a long way to go after Adam...

A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].

0. https://arxiv.org/pdf/2402.19449 1. https://arxiv.org/pdf/2402.02347

link