Hacker News new | ask | show | jobs
by wsmoses 1956 days ago
I think in essence what PartiallyTyped is trying to say is that one potential optimization opportunity in whole-program AD is that you can avoid having to cache the original inputs of the program if you know that derivative computation won't need it (e.g. its only used in a sum and not a product or something whose derivative depends on the value). Some ML frameworks must cache all of the inputs to an operation since they don't know whether it will be necessary for the reverse pass of an operation. You could go even further and decide to cache a different & smaller set of intermediate values that still lets you compute the gradient.

Beyond cache reduction, in our paper we demonstrate a lot of interesting ways that combining AD with a compiler can create potential speed-up. For example, we are often able to dead-code eliminate part of the original forward-pass code since it's not needed to compute the gradient.

We also have a cool example in the paper showing an asymptotic [O(N^2) => O(N)] speedup on a code for normalizing a vector because doing AD in the compiler lets Enzyme run after optimization (and in that example benefit from loop invariant code motion in a way that tools that aren't in the compiler cannot do).

1 comments

I think it was just a poorly stated description of the algebra of the derivative - I got the rest of it.
Yeah my best guess at that is that they were trying to say you'd only need to store one value: the sum, rather than the two individual values -- but I'm not completely sure.
That was my attempt.