Hacker News new | ask | show | jobs
by mannykannot 1069 days ago
This may be naive, but the gradient seen during one training batch would not depend only on the content of that batch, but also the outcome of all previous batches (or so I suppose.) If that is so, then whether one of these spikes occur is not only a function of the batch, but also the sequence of prior batches.