|
Well, from my perspective, there are three different components to machine learning techniques like multilayer perceptrons. Basically, 1. What's the model? 2. How do we define the error between the model and the data? 3. What algorithm do we use to fit the model to the data? In the case of a multilayer perceptron with a single hidden layer, the model is m(alpha,W,b,x) = alpha'*sigmoidal(Wx + b) where alpha : nhidden x 1 W : nhidden x ninput b : nhidden ' denotes tranpose and sigmoidal is some kind of sigmoidal function like a logistic function or arc tangent. As far defining the error, we tend to use the sum of squared errors since it's differentiable and easy: J(alpha,W,b) = sum_i (m(alpha,W,b,x[i]) - y[i])^2 Finally, we need to apply an optimization algorithm to the problem min_{alpha,W,b} J(alpha,W,b) Most of the time, people use a nonglobalized steepest descent. Honestly, this is terrible. A matrix-free inexact-Newton method globalized with either a line-search or trust-region method will work better. Anyway, all good optimization algorithms require the derivative of J above. Certainly, we can grind through these derivatives by hand if we're masochistic. Well, for one layer it's not all that bad. However, as we start to nest things, it becomes a terrible pain to derive. Rather than do this, we can just use automatic differentiation to find the gradient and Hessian-vector products of J. In fact, most AD codes already do this. I can never find a paper that's run through the details, but back propagation is basically just a reverse mode AD algorithm applied to J in order to find the gradient. |
What is the data that the machine learning algorithm would optimize with given an arbitrary program that is differentiated?
Is it just a faster program or something else?
For any changes you would need some kind of perfect 100% coverage regression test that proves that the optimized program is still correct and handles all cases because the differentiation only recorded one possible path through the program.