|
|
|
|
|
by sdenton4
808 days ago
|
|
There's a particular parameter (epsilon) in Adam which is typically set to a bad default which causes instability when the gradient gets sufficiently small. It is far easier to set epsilon to 0.001 or so than muck around with learning rate schedules... Here's another person in stack exchange who figured this out: https://stackoverflow.com/a/44844544 Pytorch and TG both use a default 1e-8. |
|
Sounds like variable epsilon is optimal, that's instead of learning rate, or both together. Would be nice if this can somehow be algorithmically regulated in generic way.