It's a bit beyond me too, but here's my take from skimming: It's already known that you can do differential calculus in some pretty abstract spaces (not just functions from R^n to R^m). Automatic differentiation is a way of automatically keeping differentiability and computing derivatives when sticking other differentiable stuff together. This puts those ideas together, so you can automatically differentiate some more interesting things stuck together in some interesting ways related to programming. That was what I got from my rusty semi-layperson's skim, at least.
Karpathy's explanation in CS231n is great: for any computational unit that is a function, you can calculate a derivative. So propagating the derivative backward through the function, just aplly the chain rule. This is much with simpler functions, so understand your algorithm down to the smallest computable units in the graph.
Does this only work for mathematical programs?
I don't understand how this would work for more general computations, involving loops, writes to files, etc.
This paper isn't really a great introduction into the concept of AD. That said, one way to visualize it is to see how a basic AD implementation works. There are generally two different varieties, operator overloading and source transformation. Let's focus on the first.
Say we have a function:
double myfun(std::vector <double> x) {
auto y = double(0.);
for(int i=0;i<=5;i++)
y += x[0]*x[1];
return y;
}
We may want to find the gradient of myfun, or simply the derivative of myfun with respect to x[0] and x[1]. What we're going to do is reply double with a special AD type, which requires us to code things like:
template <typename Real>
Real myfun(std::vector <Real> x) {
auto y = Real(0.);
for(int i=0;i<=5;i++)
y += x[0]*x[1];
return y;
}
Now, instead of using a double, we'll set Real to be our AD type. In C++, this'll be something like a class where all of the AD action occurs. What's in this class depends on the particular AD algorithm, which depends on the what derivatives we want and what kind of recomputations of the derivative we want to make. In general, though, it consists of some kind of tree. At its most basic level, the nodes of this tree consist of the types variable, constant, unary function, or binary function and they contain information that relates to the chain rule. When we want to compute a derivative, we traverse the tree in a very particular order to compute and accumulate the derivative information. How this is done can vary as well, which has implications for the efficiency of the algorithm. Most implementations use a tape that predefines the order of the tree traversal. It's also possible to do a topological sort to solve for a tree traversal. Anyway, traversing a tree can be expensive for a really, really large computational tree, so the goal is to only touch every necessary node once.
Returning to your original question, finding the derivative of a program involves building up this tree of computation. Control structures don't really matter in this context because we'll just keep adding to the tree regardless of loops or conditionals. This, by the way, can be really, really expensive since the computational tree can be huge for a long running loop. There's some game in determining how and when to collapse the tree in certain ways. Source code transformation tools can just leave the original, or really modified, control structures in place, which alleviates a lot of this expense, which is partially why they're faster than operator overloading.
Thanks for your explanation. So what this really does is enable one to train code using machine learning techniques such as gradient descent, traversing a code-space instead of traversing a space of numbers?