Hacker News new | ask | show | jobs
by rd11235 1028 days ago
Anyone who believes that this completes their understanding of automatic differentiation is tricking themselves.

When your graph is a TREE, then everything is very simple, as in this post.

When your graph is instead a more general directed acyclic graph (e.g., x = 5; y = 2x; z = xy), then the IMPLEMENTATION is still very simple, but understanding WHY that implementation works is not as simple (repeat: if you think it’s ‘just the ordinary chain rule’, you are tricking yourself).

One of the earliest descriptions of this was by Paul Werbos. He called the required rule “the chain rule for ordered derivatives”, which he proved by induction from the ordinary chain rule. But it is nevertheless not immediately evident from the ordinary chain rule.

I welcome anyone who believes otherwise to prove me wrong. If you do I will be very happy.

3 comments

Then, where to read more about this? People who built autograd and other frameworks like Pytorch, mxnet, etc. should have learnt them in details somewhere. Where? AFAIK mxnet came out of academia (probably CMU).
Here's what you do: you watch this video of Andrew Karpathy [1] called "Becoming a backprop ninja". Then you pick up a function that you like and implement this backprop (which is a different way of saying reverse mode automatic differentiation) using just numpy. If you use some numpy broadcasting, an np.sum, some for-loops, you'll start getting a good feel for what's going on.

Then you can go and read this fabulous blog post [2], and if you like what you see, you go to the framework built by its author, called Small Pebble [3]. Despite the name, it's not all that small. If you peruse the code you'll get some appreciation of what it takes to build a solid autodiff library, and if push comes to shove, you'll be able to build one yourself.

[1] https://www.youtube.com/watch?v=q8SA3rM6ckI

[2] https://sidsite.com/posts/autodiff/

[3] https://github.com/sradc/SmallPebble

I don't have a great answer. Most modern descriptions are shallow and/or unclear. My favorite discussions were actually in Werbos's original papers.

A nice overview was Backpropagation through time: what it does and how to do it, 1990. The rule itself is stated very clearly there, but without proof. The proof can be found in Maximizing long-term gas industry profits in two minutes in lotus using neural network methods, 1989 (which I believe was copied over from his earlier thesis, which I could never find a copy of).

To be honest, I never get what people want with all that business and wonder if it is because the abstraction ("ordered derivatives") implied is not ideal.

If we follow the ordinary chain rule (for a single coordinate if you want) through the edges of the computational (DAG) graph, we get the right thing in each step.

The only other rule you need is that "if you use one variable several times in a calculation (i.e. several edges from(fw)/to(bw) the same node), you need to add the gradients computed for each", but IMHO that is pretty basic and intuitive, too. (So if you plug in z for both x and y into f(x, y), you have d/dz f(z, z) = f_x(z, z) + f_y(z, z), where the subscript indicates partial derivative.)

To me this seems both mathematically simpler than mixing the two into a "more than chain rule" thing and closer to what is actually going on algorithmically in a given implementation (the one I'm most familiar with is probably PyTorch's).

But the chain rule for ordered derivatives is exactly the backprop rule. It's just the mathematical representation of 'the simple implementation' I mentioned.

I think what you're saying is that you find the process intuitive. I don't have much of a way to argue with that. But I think it's important to note that we're dealing with two things: 1. a process that we follow (backprop), 2. a true answer that is obtainable using only the chain rule. And yes it turns out that (1) and (2) both give the same answer. But (2) requires much more work, and I question anyone who claims that (1) is 'obvious' from (2): getting (1) from (2) requires work.

I'm guessing you'll agree that using only the chain rule takes much more work, but in case you don't: consider a fully connected graph with at least 5 variables, say a = 5; b = 2 a; c = 2 a b; d = 2 a b c; e = 2 a b c d. If you use backprop, you can compute de/da rapidly. If you use only the chain rule, it will take a long time to compute de/da, because the number of terms you have to deal with increases exponentially fast with the number of variables.

chain rule is defined for partial derivatives, so it's still technically just chain rule
OC's point is that the chain rule for partial derivatives shouldn't be assumed because the ordinary chain rule holds, there's more depth to it than that, and the proof is harder than you might instinctively expect based on the ordinary chain rule.

It's epistemically acceptable to understand these both as "the chain rule" once we're satisfied they've both been proved, and apply liberal amounts of synecdoche from there (and I don't think OC disagrees with you on that).

Actually by 'ordinary chain rule' I am referring to what you're referring to as 'the chain rule for partial derivatives'. It seems like backprop follows very quickly even from that, but it does not.
> chain rule is defined for partial derivatives

I agree. That's what I'm referring to as 'the ordinary chain rule'.

> so it's still technically just chain rule

No. Go try to derive backprop for general DAGs using only the chain rule. If you complete the proof, then you will agree that the proof was more elaborate than you ever expected.