Hacker News new | ask | show | jobs
by johndough 548 days ago
I tried it to train a CNN-based CIFAR10 classifier, which worked well (only a tiny bit worse than Adam, but the difference might go away with hyper parameter tuning), but the optimizer totally failed (loss -> infinity) when training a U-Net for an image segmentation task. I had to increase eps to 1e-4 and decrease lr to 1e-3 so it would not explode, but that made it very slow to converge.

My summary is that the memory savings might be great if it works, but it does not work everywhere.

2 comments

Yah I mean that's the rub with SGD... you need to spend a non-trivial compute budget on hyperparam tuning, which sometimes beats Adam.

Adam, on the other hand, generally gets you pretty good results without futzing too much with hyper params.

ah, numerical instability in the warmup stage might be the issue then?