I tried it to train a CNN-based CIFAR10 classifier, which worked well (only a tiny bit worse than Adam, but the difference might go away with hyper parameter tuning), but the optimizer totally failed (loss -> infinity) when training a U-Net for an image segmentation task. I had to increase eps to 1e-4 and decrease lr to 1e-3 so it would not explode, but that made it very slow to converge.
My summary is that the memory savings might be great if it works, but it does not work everywhere.
A single main conference publication at a top AI conference has ROI in the millions for the first author. I watched someone in the middle of their undergrad with a single ACL workshop publication get a 150K offer starting. It’s remarkable that anything real at all is published given how perverse the incentives are to blatantly make shit up.
Did you set them to use the same memory budget? Adam holds more state.
They do say it consistently matches or outperforms despite simplicity, and I think that statement means at the lower budget for their approach, but a fair comparison fk seems if it is at least promising would be take advantage of the lower memory read to add more params in their version in the comparison.
Also the paper says slow initial convergence, under limitations:
> More-
over, our methods ensure a steady and stable update during
training, allowing the model to converge better in a given
task with sufficient training steps. Thus, we might observe
that the convergence speed is relatively lower than Adam’s
in the early stage of training; as our primary focus is to
investigate the effectiveness of the SaI approach, we left the
acceleration of convergence speed in future work.
My summary is that the memory savings might be great if it works, but it does not work everywhere.