|
|
|
|
|
by yobbo
507 days ago
|
|
For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum. As a model is trained, the gradient variance typically falls. Those optimizers all work to reduce the variance of the updates in various ways. |
|