I appreciate the effort the authors put to this post, but this is like saying DNNs are stacked logistics regression: the connection is superficial, and doesn't lead to deep insights about how they really work.
It's not about "how they really work", but what data they operate on and what problems they can be applied to. When I first heard the term "transformer" from a friend, I didn't have any association in my mind because it's a very opaque term, but once he explained it to me as Graph Neural Networks, it very quickly clicked.
I'm genuinely a bit surprised by that, that was always my high-level understanding of what the essence of neural networks was (at least feedforward vanilla ones), would you care to elaborate?
It depends on what kind of understanding you want to achieve. It can be helpful to think of DNN as approximating the corresponding infinitely-wide versions. Depending on how you deal with certain scaling, they then act like a linear filter of the error signal in function space, or for single-hidden-layer networks at least, an interacting particle system. In both cases you can understand the convergence of gradient descent training using these analogies, although gaps from real-world practice exist.
While they can be thought of as stacked regression, it's only logistic regression with one particular non-linearity. And for many non-linearities you'll have a hard time usefully interpreting them as a regression.
I think the common ones have statistical interpretations (that predate deep learning by a lot). Perhaps the one for the rectified linear unit is pretty obscure. But as I understand it, the statistics concept is called the "Tobit" model. It's meaning is not so obscure though, just a prediction that can be non-negative only, which is pretty common like a mass or energy.
You mean like arxiv-sanity? As I understand it, it trains a SVM on papers you like to suggest papers that are on the same side of the hyperplane. Could be used as a quality classifier by only liking high-quality work.
It's not about "how they really work", but what data they operate on and what problems they can be applied to. When I first heard the term "transformer" from a friend, I didn't have any association in my mind because it's a very opaque term, but once he explained it to me as Graph Neural Networks, it very quickly clicked.