| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zingelshuher 804 days ago
	There are unstable cases when static learning rate doesn't work. Solution starts wobbling too much after some time and explodes. Using too small LR from the beginning leads to local minima. Making it stable _is_ possible, but it's a different story.

1 comments

sdenton4 804 days ago

There's a particular parameter (epsilon) in Adam which is typically set to a bad default which causes instability when the gradient gets sufficiently small. It is far easier to set epsilon to 0.001 or so than muck around with learning rate schedules...

Here's another person in stack exchange who figured this out: https://stackoverflow.com/a/44844544

Pytorch and TG both use a default 1e-8.

link

zingelshuher 803 days ago

"the bigger you make epsilon "... " thus slower the training progress will be"

Sounds like variable epsilon is optimal, that's instead of learning rate, or both together. Would be nice if this can somehow be algorithmically regulated in generic way.

link

sdenton4 802 days ago

The training slowdown is not really a problem... There's a pretty wide range of robust, good-enough values that don't slow things down much at all. As with all optimizer cruft, the 'optimal' value is going to be problem-dependent and a pain in the butt to actually find. So it's best to find a good-enough value that works in most contexts and not worry about it.

link

knightoffaith 804 days ago

I hope someone's submitted a PR!

link

sdenton4 804 days ago

There's used to be a note about it on the tensorflow docs; they're keeping the bad default because it's the default, and changing it would potentially change behavior unexpectedly for lots of users.

link