|
|
|
|
|
by minimaltom
38 days ago
|
|
Theres been multiple papers discussing how only updating parameters that have high agreement in update direction leads to less overfitting and better generalization. Lemme see if I can find em. https://arxiv.org/abs/2411.16085 - set updates to 0 where theres disagreement in the sign of the parameter update - got accepted! https://arxiv.org/pdf/2412.18052 - discard gradient updates from batches/minibatches that disagree where disagree means cosine distance threshold (they solved for 0.97 or something being optimal) |
|