| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cl3misch 299 days ago
	I actually see this a lot: confusing backpropagation with gradient descent (or any optimizer). Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights. I guess giving the (mathematically) simple principle of computing a gradient with the chain rule the fancy name "backpropagation" comes from the early days of AI where the computers were much less powerful and this seemed less obvious?

2 comments

imtringued 299 days ago

The German Wikipedia article makes the same mistake and it is quite infuriating.

link

cubefox 299 days ago

What does this comment have to do with the previous comment, which talked about supervised learning?

link

imtringued 299 days ago

Reread the comment

"Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights."

What does the word supervised mean? It's when you define a cost function to be the difference between the training data and the model output.

Aka something like (f(x)-y)^2 which is simply the quadratic difference between the result of the model given an input x from the training data and the corresponding label y.

A learning algorithm is an algorithm that produces a model given a cost function and in the case of supervised learning, the cost function is parameterized with the training data.

The most common way to learn a model is to use an optimization algorithm. There are many optimization algorithms that can be used for this. One of the simplest algorithms for the optimization of unconstrained non-linear functions is stochastic gradient descent.

It's popular because it is a first order method. First order methods only use the first partial derivative known as the gradient whose size is equal to the number of parameters. Second order methods converge faster, but they need the Hessian, whose size scales with the square of the to be optimized parameters.

How do you calculate the gradient? Either you calculate each partial derivative individually, or you use the chain rule and work backwards to calculate the complete gradient.

I hope this made it clear that your question is exactly backwards. The referenced blog is about back propagation and unnecessarily mentions supervised learning when it shouldn't have done that and you're the one now sticking with supervised learning even though the comment you're responding to told you exactly why it is inappropriate to call back propagation a supervised learning algorithm.

link

DoctorOetker 298 days ago

regarding "supervised", it is a bit of a small nuance.

Traditional "supervised" training, required the dataset to be annotated with labels (good/bad, such-and-such a bounding box in an image, ...) which cost a lot of human labor to produce.

When people speak of "unsupervised" training, I actually consider it a misnomer: its historically grown, and the term will not go away quickly, but a more apt name would have been "label-free" training.

For example consider a corpus of human written text (books, blogs, ...) without additional labels (verb annotations, subject annotations, ...).

Now consider someone proposing to use next-token prediction, clearly it doesn't require additional labeling. Is it supervised? Nobody calls it supervised under the current convention, but actually one may view next-token prediction on a bare text corpus as a trick to turn an unlabeled dataset into trillions of supervised prediction tasks. Given this N-gram of preceding tokens, what does the model predict as the next token? And what does the corpus actually say as next token? Lets use this actual next token as if it were a "supervised" (labeled) exercise.

link

cubefox 298 days ago

That's also why LeCun promoted the term "self-supervised" a while ago, with some success.

link

cl3misch 299 days ago

The previous comment highlights an example where backprop is confused with "a supervised learning algorithm".

My comment was about "confusing backpropagation with gradient descent (or any optimizer)."

For me the connection is pretty clear? The core issue is confusing backprop with minimization. The cited article mentioning supervised learning specifically doesn't take away from that.

link