If I speak Japanese at you (and you are a non-Japanese speaker), I will just confuse you. Instead if you spend two years learning Japanese, then I share some information with you in Japanese, you will learn something new and become more knowledgeable.
That is a good analogy. The insight is improved by realising that in the human context the confusion is temporary and results in the rejection of the data. In the LLM it is forced into the matrix in the incorrect context, so it is harmful.
"In this work, we argue that the training loss instabilities observed in large-scale training should be
associated with the time-domain correlation between the gradient estimates of earlier layers in the
deep-learning models. Based on the identified connection, we propose several ways to mitigate the
instabilities, along with the heuristic method that was known in the literature. We conclude that at
this point, there is no silver bullet to solve the problem, and the appropriate remedy depends on the
specific setup of the large-scale training run."
This may be naive, but the gradient seen during one training batch would not depend only on the content of that batch, but also the outcome of all previous batches (or so I suppose.) If that is so, then whether one of these spikes occur is not only a function of the batch, but also the sequence of prior batches.
If I speak Japanese at you (and you are a non-Japanese speaker), I will just confuse you. Instead if you spend two years learning Japanese, then I share some information with you in Japanese, you will learn something new and become more knowledgeable.