| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kmmlng 895 days ago
	> This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name. Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.

1 comments

pmayrgundter 894 days ago

Agreed Hebbian learning isn't used.. just meant it as an example of what would signal a NN.

For Backprop, I'm basing this off the development of the Perception. Wiki supports this and its bio-inslired origin[1].

As for its use in Transformers, if you mean simple regressing of errors or use of gradient descent, I'd agree, but that's not usually called Backprop and the term isn't used in the original paper. The term typically means back propagating the errors thru the entire network at a certain stage of learning, and that's not present in Transformers that I can tell.

Happy to see any support for your claims tho.

https://en.m.wikipedia.org/wiki/Backpropagation

link

kmmlng 894 days ago

What do you mean, the development of the "Perception"? Do you mean the Perceptron? In that case, Backprop was invented way later than the Perceptron (see https://people.idsia.ch/~juergen/who-invented-backpropagatio...).

I don't see any information in your linked Wikipedia article that supports a bio-inspired origin. In fact, researchers have been wondering whether an equivalent to Backprop might be found in biological brains, but Backprop is widely believed to be biologically implausible (see e.g. https://arxiv.org/pdf/1502.04156.pdf, https://www.sciencedirect.com/science/article/pii/S089360801...).

It's not surprising that the term Backprop is not mentioned in the original paper, it isn't mentioned in most neural network research, because it's simply the default method to optimize weights and additionally it's hidden away by modern autodiff frameworks, so no one actually has to give it any thought. But backprop is definitely used in transformers (see e.g. https://aclanthology.org/2020.emnlp-main.463.pdf, https://arxiv.org/pdf/2004.08249, https://proceedings.mlr.press/v202/phang23a/phang23a.pdf, https://dinkofranceschi.com/docs/bft.pdf)

link

pmayrgundter 894 days ago

Ah yes, Perceptron. Had a couple typos.. sorry, was on phone.

The bio-inspiration was via Frank Rosenblatt, who is referred to in that article tho yeah, the history is over in his article:

https://en.wikipedia.org/wiki/Frank_Rosenblatt#Perceptron

"Rosenblatt was best known for the Perceptron, an electronic device which was constructed in accordance with biological principles and showed an ability to learn.

He developed and extended this approach in numerous papers and a book called Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, published by Spartan Books in 1962.[6] He received international recognition for the Perceptron.

The Mark I Perceptron, which is generally recognized as a forerunner to artificial intelligence, currently resides in the Smithsonian Institution in Washington D.C."

Your Juergen page is interesting, tho no direct comment on Rosenblatt there. He does cite the work on this page:

https://people.idsia.ch/~juergen/deep-learning-overview.html (refs R58, R61)

My reading is that a long-known idea, about multi-variate regression, was reinterpreted by Rosenblatt by 1958 via the bio-inspired Perceptron, and then that was criticized by Minksy and others and viable methods were achieved by 1965. When I was taught NNs by Mitchell at CMU in the 1990s (lectures similar to his book Machine Learning), this was the same basic story. Also reminds me of a moment in class one day when a Stats Prof who was surveying the course broke out with "but wait, isn't this all just multivariate regression??" :) Mitchell agreed to the functional similarity, but I think that helps highlight how the biomimicry was crucial to developing the idea. it had laid hidden in plain sight for a century.

Agreed, and I was aware, there has since been criticism of the biological plausibility of backprop.

Your further links with refs to backprop in transformers are interesting; I hadn't seen these. It's clear the term is being used like you say, tho I still see ambiguity of it utility here. Autodifferentiation, gradient descent, multi-variate regerssion etc. are ofc in common use and scanning these papers it's not clear to me the terms aren't simply to a point of conflation. What had stood unique for me with backprop was a coherent whole-network regression. This to me looks like a piecewise approach.

But anyways, I see your point. Thanks!

link

pmayrgundter 894 days ago

Got me reading the original. It's rad.

Link to PDF and some screens from intro here..

https://twitter.com/PMayrgundter/status/1743096776456867921

link