|
|
|
|
|
by t-vi
1035 days ago
|
|
To be honest, I never get what people want with all that business and wonder if it is because the abstraction ("ordered derivatives") implied is not ideal. If we follow the ordinary chain rule (for a single coordinate if you want) through the edges of the computational (DAG) graph, we get the right thing in each step. The only other rule you need is that "if you use one variable several times in a calculation (i.e. several edges from(fw)/to(bw) the same node), you need to add the gradients computed for each", but IMHO that is pretty basic and intuitive, too. (So if you plug in z for both x and y into f(x, y), you have d/dz f(z, z) = f_x(z, z) + f_y(z, z), where the subscript indicates partial derivative.) To me this seems both mathematically simpler than mixing the two into a "more than chain rule" thing and closer to what is actually going on algorithmically in a given implementation (the one I'm most familiar with is probably PyTorch's). |
|
I think what you're saying is that you find the process intuitive. I don't have much of a way to argue with that. But I think it's important to note that we're dealing with two things: 1. a process that we follow (backprop), 2. a true answer that is obtainable using only the chain rule. And yes it turns out that (1) and (2) both give the same answer. But (2) requires much more work, and I question anyone who claims that (1) is 'obvious' from (2): getting (1) from (2) requires work.
I'm guessing you'll agree that using only the chain rule takes much more work, but in case you don't: consider a fully connected graph with at least 5 variables, say a = 5; b = 2 a; c = 2 a b; d = 2 a b c; e = 2 a b c d. If you use backprop, you can compute de/da rapidly. If you use only the chain rule, it will take a long time to compute de/da, because the number of terms you have to deal with increases exponentially fast with the number of variables.