What kind of work are you referring to when you say higher-order SGD may _now_ be feasible for deep learning?
I only find results that try to approximate second order information.
Not sure what you mean. The paper above claims 1000x speedups for computing second-order derivatives. Have not tested their claims, but was speculating that such an improvement, if true, would make computing hessians for small networks fesiable. This is what I am referring to.