And why it took a long time for back propagation to be introduced into machine learning..
Back propagation is (almost) just a fancy word for differential equation, with derivative relative to the error in the output against your training data.
As someone who's starting to learn a bit about machine learning, it feels like the whole field is full of fancy terms like this that seem to mostly map to simpler or more familiar ones. "linear regression" instead of fitting a line, "hyperparameter" instead of user-provided argument. Half the battle seems to be building this mental translation map.
You are looking at it from a programmer standpoint rather than a mathematical standpoint.
Linear regression isn't just fitting a line, it's a statistical technique to fit a line of best fit. Hyperparameters are a bayesian term for parameters outside the system of test or "algorithm". User input really misses the bayesian aspect.
These terms actually have meaning so I'd be careful ascribe simpler definitions. The underlying meaning is important to the reason they work. If you don't have a really strong background in probability theory and statistics trying to dig into machine learning will take work. Id recommend taking an MITx course or picking up a textbook on probability so the terminology feels more natural.
A user-provided argument could also be an input parameter or a regular function parameter altogether.
Yes, hyperparameters are often set by the user of a model, but more specifically they are parameters that exist separately from the data put into a model (input parameters) or the structure inside of neural networks (hidden parameters). Hyper- meaning above, helps conceptualize these parameters as existing outside the model.
Yes, backpropagation isn't the chain rule itself, but just an efficient way to calculate the chain rule. (In this respect there are some connections to dynamic programming, where you find the most efficient order of recursive computations to arrive at the solution).
I think of it as: computing the chain rule in the order such that we never need to compute Jacobians explicitly; only Jacobian-vector products.
I also didn't totally grasp its significance until implementing neural networks from matrix/array operations in NumPy. I hope all deep learning courses include this exercise.
Yes, they are not the same. The chain rule is what solves the one non-trivial problem with backpropagation. Besides that, it's just the quite obvious idea of changing the weights in proportion to how impactful they are on the error.
Is that why it took long? I was under the impression it was because of diminishing gradients in backprop once you stack a huge amount of layers (the deep in deep neural networks).
The reverse mode has famously been re-discovered (or re-applied) many times, for example as backpropagation in ML, and as AAD in finance (to compute "Greeks", ie partial derivatives of the value of a product wrt many inputs).
Back propagation is (almost) just a fancy word for differential equation, with derivative relative to the error in the output against your training data.