| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by markisus 252 days ago

The authors somewhat address your questions in the accompanying paper https://arxiv.org/abs/2410.24206

> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.

They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.

1 comments

DoctorOetker 252 days ago

Yes, it certainly provides a lot more clarity than the handwaving.

While momentum seems to work, and the authors clearly state it is not intended as a practical optimization method, I can't exclude that we can improve convergence rates by building on this knowledge.

Is it guaranteed for the oscillating behavior to have a period of 2 steps? or is say 3 step period also possible (a vector in a plane could alternately point to 0 degrees, 120 degrees and 240 degrees).

The way I read this presentation the implication seems to be that its always a period of 2. Perhaps if the top-2 sharpnesses are degenerate (identical), a period of N distinct from 2 could be possible?

It makes you wonder what if instead of storing momentum with exponential moving average one were to use the average of the last 2 iterations, so there would be less lag.

It also makes me wonder if we should perform 2 iterative steps PER sequence so that the single-sample-sequence gives feedback along it's valley instead of across it. One would go through the corpus at half the speed, but convergence may be more accurate.

kingstnap 251 days ago

I doubt it would ever be period 3.

Not a formal proof, but there is this fun theorem called period 3 implies chaos that my gut instinct says applies here.

Basically if you have a continuous mapping from [a,b] -> [a,b] and there exists a 3 cycle then that implies every other cycle length exists.

Which in this case would kinda say that if you are bouncing between three values on the y axis (and the bouncing is a continuous function which admittedly the gradient of a relu is not) you are probably in a chaotic system

Now that requires assuming that the behaviour of y is largely a function of just y. But their derivation seems to imply that it is the case.