|
|
|
|
|
by vlovich123
448 days ago
|
|
> My understanding is that the proposed method is faster in the sense of sampling efficiency (of the cost function to construct the Taylor series), but not in the sense of FLOPS. The higher derivatives do not come for free. Sure, but as long as this remains cheaper than the process of computing the next convergence, this would still be a net win. For example the article talks about how AI training uses gradient descent and I’m pretty sure that the gradient descent part is a tiny fraction of the time spent training vs evaluating the math kernels in all the layers; therefore taking fewer steps should be a substantial win. |
|
Unfortunately not.