Hacker News new | ask | show | jobs
by itsthecourier 1069 days ago
So it happens that when training LLM some training batches worsen them model, but such batches actually improve it when fed later, why?
3 comments

It makes intuitive sense.

If I speak Japanese at you (and you are a non-Japanese speaker), I will just confuse you. Instead if you spend two years learning Japanese, then I share some information with you in Japanese, you will learn something new and become more knowledgeable.

That is a good analogy. The insight is improved by realising that in the human context the confusion is temporary and results in the rejection of the data. In the LLM it is forced into the matrix in the incorrect context, so it is harmful.
"In this work, we argue that the training loss instabilities observed in large-scale training should be associated with the time-domain correlation between the gradient estimates of earlier layers in the deep-learning models. Based on the identified connection, we propose several ways to mitigate the instabilities, along with the heuristic method that was known in the literature. We conclude that at this point, there is no silver bullet to solve the problem, and the appropriate remedy depends on the specific setup of the large-scale training run."
So, it's a form of superstition?
This may be naive, but the gradient seen during one training batch would not depend only on the content of that batch, but also the outcome of all previous batches (or so I suppose.) If that is so, then whether one of these spikes occur is not only a function of the batch, but also the sequence of prior batches.