Hacker News new | ask | show | jobs
by zingelshuher 804 days ago
There are unstable cases when static learning rate doesn't work. Solution starts wobbling too much after some time and explodes. Using too small LR from the beginning leads to local minima. Making it stable _is_ possible, but it's a different story.
1 comments

There's a particular parameter (epsilon) in Adam which is typically set to a bad default which causes instability when the gradient gets sufficiently small. It is far easier to set epsilon to 0.001 or so than muck around with learning rate schedules...

Here's another person in stack exchange who figured this out: https://stackoverflow.com/a/44844544

Pytorch and TG both use a default 1e-8.

"the bigger you make epsilon "... " thus slower the training progress will be"

Sounds like variable epsilon is optimal, that's instead of learning rate, or both together. Would be nice if this can somehow be algorithmically regulated in generic way.

The training slowdown is not really a problem... There's a pretty wide range of robust, good-enough values that don't slow things down much at all. As with all optimizer cruft, the 'optimal' value is going to be problem-dependent and a pain in the butt to actually find. So it's best to find a good-enough value that works in most contexts and not worry about it.
I hope someone's submitted a PR!
There's used to be a note about it on the tensorflow docs; they're keeping the bad default because it's the default, and changing it would potentially change behavior unexpectedly for lots of users.