Hacker News new | ask | show | jobs
by colordrops 1400 days ago
A class taught like this for me was what got me to quit physics and switch to CS.
1 comments

And why it took a long time for back propagation to be introduced into machine learning..

Back propagation is (almost) just a fancy word for differential equation, with derivative relative to the error in the output against your training data.

As someone who's starting to learn a bit about machine learning, it feels like the whole field is full of fancy terms like this that seem to mostly map to simpler or more familiar ones. "linear regression" instead of fitting a line, "hyperparameter" instead of user-provided argument. Half the battle seems to be building this mental translation map.
You are looking at it from a programmer standpoint rather than a mathematical standpoint.

Linear regression isn't just fitting a line, it's a statistical technique to fit a line of best fit. Hyperparameters are a bayesian term for parameters outside the system of test or "algorithm". User input really misses the bayesian aspect.

These terms actually have meaning so I'd be careful ascribe simpler definitions. The underlying meaning is important to the reason they work. If you don't have a really strong background in probability theory and statistics trying to dig into machine learning will take work. Id recommend taking an MITx course or picking up a textbook on probability so the terminology feels more natural.

To be fair, "linear regression" is standard statistics 101 that much predates machine learning or computers.
A user-provided argument could also be an input parameter or a regular function parameter altogether.

Yes, hyperparameters are often set by the user of a model, but more specifically they are parameters that exist separately from the data put into a model (input parameters) or the structure inside of neural networks (hidden parameters). Hyper- meaning above, helps conceptualize these parameters as existing outside the model.

Actually, backpropagation is more of a fancy word for the chain rule.
ALMOST like using the chain rule

Backpropagation ≠ Chain Rule: https://theorydish.blog/2021/12/16/backpropagation-≠-chain-r...

That's just nitpicking, but ok: backpropagation is the application of the chain rule for total derivatives.

Look into forward- vs reverse-mode automatic differentiation, and you'll understand what I'm referring to.

Yes, backpropagation isn't the chain rule itself, but just an efficient way to calculate the chain rule. (In this respect there are some connections to dynamic programming, where you find the most efficient order of recursive computations to arrive at the solution).
I think of it as: computing the chain rule in the order such that we never need to compute Jacobians explicitly; only Jacobian-vector products.

I also didn't totally grasp its significance until implementing neural networks from matrix/array operations in NumPy. I hope all deep learning courses include this exercise.

Yes, they are not the same. The chain rule is what solves the one non-trivial problem with backpropagation. Besides that, it's just the quite obvious idea of changing the weights in proportion to how impactful they are on the error.
Is that why it took long? I was under the impression it was because of diminishing gradients in backprop once you stack a huge amount of layers (the deep in deep neural networks).
Could you please forward me to a resource that explains this connection?
The reverse mode has famously been re-discovered (or re-applied) many times, for example as backpropagation in ML, and as AAD in finance (to compute "Greeks", ie partial derivatives of the value of a product wrt many inputs).

A few resources here:

An overview, with a bias towards finance: https://informaconnect.com/a-brief-introduction-to-automatic...

On the history: Andreas Griewank, Who Invented the Reverse Mode of Differentiation? https://ftp.gwdg.de/pub/misc/EMIS/journals/DMJDMV/vol-ismp/5...

On the history of back propagation: https://en.wikipedia.org/wiki/Backpropagation#History

The article that introduced it to finance: Michael Giles and Paul Glasserman, Smoking adjoints: fast Monte Carlo Greeks https://www0.gsb.columbia.edu/faculty/pglasserman/Other/Risk...

Survey of the application in finance: Cristian Homescu, Adjoints and Automatic (Algorithmic) Differentiation in Computational Finance https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1828503

It was in one of the fast.ai courses, I think where Jeremy did back propagation using Excel

https://www.fast.ai/

Could be that someone else here remember the exact video

Hope you don't mind me plugging my blog post, that covers chain rule -> autodiff -> training of nn. https://sidsite.com/posts/autodiff/
Absolutely not. Thank you for sharing.