| HN Mirror

Not to be peculiar, but I don't know if approximating the hessian using the gradient counts as a second order method. I was talking about "full-blown" second order methods where you compute de hessian through AD.

Furthermore, I don't think by "moment of the gradients" they actually mean second derivatives.

Also from the paper: We introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions...

It's written right in the abstract that the authors consider it a first-order method.