Hacker News new | ask | show | jobs
by kaelan123 753 days ago
Those are valid points! Hessian-free (HF) optimization is a really nice method, but as you say remains costly so people don't use it. The key idea in this paper is that if you are able to solve linear systems faster by using an analog device, the cost of a HF-like method is brought down, so the method can become competitive.

About the noise, it is true that the second-order information will be noisier than the gradient for a given batch size (and a lot of results out there for HF optimization are with impractically large batch sizes). In the paper we use relatively small batch sizes (eg 32 for the fine-tuning example) and show that you can still get an advantage from second-order information. Of course it would be interesting to study in more detail how noisy 2nd order information can be, and on more datasets.