|
|
|
|
|
by brandonb
98 days ago
|
|
Very cool. I learned something new about why EMA (exponential moving average) is needed: > EMA-based training dynamics like JEPA’s don’t optimize any smooth mathematical function, yet they provably converge to useful, non-collapsed representations. All the papers say EMA avoids “representation collapse” without justifying it. Didn’t realize there were any theoretical results here. |
|
EMA helps because it changes more slowly than the learning network which prevents rapid collapse by forcing the predictions to align to what a historical average would predict. This is a harder and more informative task because the model can't trivially output one value and have it match the EMA target so the model learns more useful representations.
EMA has a long history in deep learning (many GANs use it, TD-learning like DQN, many JEPA papers, etc.) so authors often omit defense of it due to over-familiarity or sometimes cargo culting. :)