What is classic about "skip updating parameters with high gradient/loss variance in multiple batches/samples"? Do you have a particular algorithm in mind that uses this heuristic?
Theres been multiple papers discussing how only updating parameters that have high agreement in update direction leads to less overfitting and better generalization. Lemme see if I can find em.
https://arxiv.org/abs/2411.16085 - set updates to 0 where theres disagreement in the sign of the parameter update - got accepted!
https://arxiv.org/pdf/2412.18052 - discard gradient updates from batches/minibatches that disagree where disagree means cosine distance threshold (they solved for 0.97 or something being optimal)
https://arxiv.org/abs/2411.16085 - set updates to 0 where theres disagreement in the sign of the parameter update - got accepted!
https://arxiv.org/pdf/2412.18052 - discard gradient updates from batches/minibatches that disagree where disagree means cosine distance threshold (they solved for 0.97 or something being optimal)