Hacker News new | ask | show | jobs
by yorwba 48 days ago
What is classic about "skip updating parameters with high gradient/loss variance in multiple batches/samples"? Do you have a particular algorithm in mind that uses this heuristic?
1 comments

Theres been multiple papers discussing how only updating parameters that have high agreement in update direction leads to less overfitting and better generalization. Lemme see if I can find em.

https://arxiv.org/abs/2411.16085 - set updates to 0 where theres disagreement in the sign of the parameter update - got accepted!

https://arxiv.org/pdf/2412.18052 - discard gradient updates from batches/minibatches that disagree where disagree means cosine distance threshold (they solved for 0.97 or something being optimal)