| HN Mirror

Alright, so machine learning is a pretty broad term that encompasses a lot of things. Very specifically above, I was talking about things I call curve fitters, which include multilayer perceptrons. There's also classifiers, which often people create by use a curve fitter and then thresholding it. There's other strange techniques as well that people call machine learning that don't fit into these categories as well.

Anyway, for curve fitters, we literally need a set of points {(x_i,y_i)}_{i=1}^m given to us and our goal is to map the n-dimensional vectors x_i to the scalar output y_i. If we want a multidimensional output, we just do this more than once. Now, theoretically, we can try and match any function, or program, to this data. Let phi be this model, so that we want phi(alpha,x)=y. Here, alpha are the parameters we want to tune for the matching. Also note, these parameters can't be arbitrary. We really need them to be some kind of floating point number. Now, how we do the matching can vary, but sum of squared errors is easy, so we solve the optimization problem

min_{alpha} sum_i (phi(alpha,x_i)-y_i)^2

Basically, we want to find a parameter alpha so that the output from program phi matches our data {(x_i,y_i)}_{i=1}^m as closely as possible in a least-squares sense. Since all good optimization algorithms need a derivative, we use automatic differentiation to find the gradient, and probably Hessian-vector product, of

J(alpha) = sum_i (phi(alpha,x_i)-y_i)^2

Now, if phi is a program, then so is J. It's a program that takes alpha as an input, runs a for-loop to sum up the squared errors, and then it returns the result.

Alright, that was long winded, but the key is this. We're not optimizing a program when we do this. Rather, the optimization occurs when we minimize the difference between our inputs and our outputs, which were given to us ahead of time. In this context, we have two programs. One, the program that we're trying to fit to our data. Two, the program that determines the amount of misfit between our inputs and our outputs. We need the derivative of the composition of these two programs, which is why we use automatic differentiation.