| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sdenton4 808 days ago

There's a particular parameter (epsilon) in Adam which is typically set to a bad default which causes instability when the gradient gets sufficiently small. It is far easier to set epsilon to 0.001 or so than muck around with learning rate schedules...

Here's another person in stack exchange who figured this out: https://stackoverflow.com/a/44844544

Pytorch and TG both use a default 1e-8.

2 comments

zingelshuher 807 days ago

"the bigger you make epsilon "... " thus slower the training progress will be"

Sounds like variable epsilon is optimal, that's instead of learning rate, or both together. Would be nice if this can somehow be algorithmically regulated in generic way.

link

sdenton4 806 days ago

The training slowdown is not really a problem... There's a pretty wide range of robust, good-enough values that don't slow things down much at all. As with all optimizer cruft, the 'optimal' value is going to be problem-dependent and a pain in the butt to actually find. So it's best to find a good-enough value that works in most contexts and not worry about it.

link

knightoffaith 808 days ago

I hope someone's submitted a PR!

link

sdenton4 808 days ago

There's used to be a note about it on the tensorflow docs; they're keeping the bad default because it's the default, and changing it would potentially change behavior unexpectedly for lots of users.

link