|
|
|
|
|
by cfgauss2718
818 days ago
|
|
By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach. I’m not sure what insights there are to be gained here, it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?”. Maybe this seems reductive, but once you nail down what the loss functional is (often MSE loss for regression or diffusion models, cross entropy for classification, and many others) and perhaps the particulars of the model architecture (feed-forward vs recurrent, fully connected bits vs convolutions, encoder/decoders) then it’s unclear to me what is left for us to discover about how “learning” works beyond understanding old fundamental algorithms like Newton-Krylov for minimizing nonlinear functions (which subsumes basically all deep learning and goes well beyond). My gut tells me that the curious among you should spend more time learning about fundamentals of optimization than puzzling over some special (and probably non-existent) alchemy inherent in deep networks. |
|
Asking things like properties of the pseudoinverse against a dataset on some distribution (or even properties of simple regression) is interesting and useful. If we could understand neural networks as well as we understand linear regression, it would be a massive breakthrough, not a boring "it's just minimizing a loss function" statement.
Hell even if you just ask about minimizing things, you get a whole theory of M estimators [0]. This kind of dismissive comment doesn't add anything.
[0] https://en.wikipedia.org/wiki/M-estimator