| HN Mirror

Theres been multiple papers discussing how only updating parameters that have high agreement in update direction leads to less overfitting and better generalization. Lemme see if I can find em.

https://arxiv.org/abs/2411.16085 - set updates to 0 where theres disagreement in the sign of the parameter update - got accepted!

https://arxiv.org/pdf/2412.18052 - discard gradient updates from batches/minibatches that disagree where disagree means cosine distance threshold (they solved for 0.97 or something being optimal)