Hacker News new | ask | show | jobs
by eden-u4 548 days ago
Tried the source code on a toy model: adam took 2 epochs to train a 10k parameters model, this didn't achieve anything useful in 20.

Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?

4 comments

I tried it to train a CNN-based CIFAR10 classifier, which worked well (only a tiny bit worse than Adam, but the difference might go away with hyper parameter tuning), but the optimizer totally failed (loss -> infinity) when training a U-Net for an image segmentation task. I had to increase eps to 1e-4 and decrease lr to 1e-3 so it would not explode, but that made it very slow to converge.

My summary is that the memory savings might be great if it works, but it does not work everywhere.

Yah I mean that's the rub with SGD... you need to spend a non-trivial compute budget on hyperparam tuning, which sometimes beats Adam.

Adam, on the other hand, generally gets you pretty good results without futzing too much with hyper params.

ah, numerical instability in the warmup stage might be the issue then?
More likely a bogus paper, neither their mathematical reasoning nor their experiments seem to hold up if you look at them closely.
A single main conference publication at a top AI conference has ROI in the millions for the first author. I watched someone in the middle of their undergrad with a single ACL workshop publication get a 150K offer starting. It’s remarkable that anything real at all is published given how perverse the incentives are to blatantly make shit up.
Did you set them to use the same memory budget? Adam holds more state.

They do say it consistently matches or outperforms despite simplicity, and I think that statement means at the lower budget for their approach, but a fair comparison fk seems if it is at least promising would be take advantage of the lower memory read to add more params in their version in the comparison.

Also the paper says slow initial convergence, under limitations:

> More- over, our methods ensure a steady and stable update during training, allowing the model to converge better in a given task with sufficient training steps. Thus, we might observe that the convergence speed is relatively lower than Adam’s in the early stage of training; as our primary focus is to investigate the effectiveness of the SaI approach, we left the acceleration of convergence speed in future work.

Was the toy model a transformer?

Maybe it's just way too small, you wouldn't use Karatsuba multiplication to do 3*5.

that's a wrong simile given that you would get the same end result in both cases.

I'm not using a transformer, just a plain Feedforward, Relu and dropout for a simple classifier.

I don't know, I can be wrong. I hope and some toy experiment shows that even in low case parameters it works fine as well as adam.