Hacker News new | ask | show | jobs
by cfgauss2718 818 days ago
By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach. I’m not sure what insights there are to be gained here, it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?”. Maybe this seems reductive, but once you nail down what the loss functional is (often MSE loss for regression or diffusion models, cross entropy for classification, and many others) and perhaps the particulars of the model architecture (feed-forward vs recurrent, fully connected bits vs convolutions, encoder/decoders) then it’s unclear to me what is left for us to discover about how “learning” works beyond understanding old fundamental algorithms like Newton-Krylov for minimizing nonlinear functions (which subsumes basically all deep learning and goes well beyond). My gut tells me that the curious among you should spend more time learning about fundamentals of optimization than puzzling over some special (and probably non-existent) alchemy inherent in deep networks.
9 comments

> it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?

Asking things like properties of the pseudoinverse against a dataset on some distribution (or even properties of simple regression) is interesting and useful. If we could understand neural networks as well as we understand linear regression, it would be a massive breakthrough, not a boring "it's just minimizing a loss function" statement.

Hell even if you just ask about minimizing things, you get a whole theory of M estimators [0]. This kind of dismissive comment doesn't add anything.

[0] https://en.wikipedia.org/wiki/M-estimator

You raise a fair point, I do think that it’s important to understand how the properties of the data manifest in the least-squares solution to Ax=b. Without that, the only insights we have are from analysis, while we would be remiss to overlook the more fundamental theory, which is linear algebra. However, my suspicion is that the answer to these same questions but applied to nonlinear function approximators is probably not much different from the insights we have already gained in more basic systems. However, the overly broad title of the manuscript doesn’t seem to point toward those kinds of questions (specifically, things like “how do properties of the data manifold manifest in the weight tensors”) and I’m not sure that one should equate those things to “learning”.
This is overly reductive. Understanding what they're doing at a higher level is useful. If you knew enough about neuron activations and how they change with stimulus that wouldn't be enough for a human to develop a syllabus for teaching maths even if they "understand how people learn".

What you describe also doesn't answer the question of how to structure and train a model, which surely is quite important. How do the choices impact real world problems?

Sure, but their title seems poorly chosen and doesn't match what they are claiming in the article itself, which includes understanding how GPT-2 makes it's predictions.

How does GPT-2 learn, for example, that copying a word from way back in the context helps it to minimize the prediction error? How does it even manage to copy a word from the context to the output? We know that it is minimizing prediction errors, and learned to do so via gradient descent, but HOW is it doing it? (we've discovered a few answers, but it's still a research area)

I haven’t read the manuscript yet, and am not sure that I will. However I don’t agree with the question. Gradient descent, the properties of the loss function are the “how”. It seems like you want to know how some properties of the data are manifested in the network itself during/after training (what these properties are doesn’t seem to be something that people know they are looking for). Maybe that’s what the authors are interested in as well. If I could bet money in Vegas on the answer to that question, my bet would be in most cases that structures we may probe in the network and see in them correlations to aspects of the problem or task that we (as humans) can recognize, well very likely this will boil down to approximations of fundamental and eminently useful quantities like, say, approximate singular value decompositions of regions in the data manifold, or approximate eigenfunctions etc. I could see how these kind of empirical investigations are interesting, but what would their impact be? Another guess, that these investigations may lead to insights that help engineers design better architectures or incrementally improve training methods. But I think that’s about it - this type of research strikes me as engineering and application.
Outside of pure interest - how these LLMs are working, the utility/impact of understanding them would be to be able to control them - how to remove capabilities you don't want them to have (safety), or perhaps even add capabilities, or just steer their behavior in some desirable way.

Pretty much everything about NNs is engineering - it's basically an empirical technology, not one that we have much theoretical understanding of outside of the very basics.

> Pretty much everything about NNs is engineering - it's basically an empirical technology, not one that we have much theoretical understanding of outside of the very basics.

This pretty much answers the question some have asked: “why are the world’s preeminent mathematicians not working on AI if AGI will solve everything eventually anyway?”.

At least for now, the skills required to make progress in AI (machine learning as it largely is now) are those of an engineer rather than a mathematician.

> By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach.

Are the rules of chess all there is to it? Is there really no more to be said?

Well, if neural nets are nothing more than their optimization problem then why isn't there a mathematical proof of this already?

And why isn't that reductionism? We don't say human learning is merely the product of millions of years of random evolution, and leave it at that. So if we take a position on reductionist account of learning, then how do we prove it or disprove it?

Are there arguments that don't rest on our gut feelings? Otherwise this is just different experts factions arguing that "neural nets are/aren't superautocomplete / stochastic parrots" but with more technobabble.

Im with you. My only understanding of ML is a class in 2016 where we implemented basic ML algos and not neutral nets, gpts or whatever but I always assumed its no radically different.

Take a bunch of features or make up a billion features, find a function to that best predicts the greatest number of outputs correctly. Any "emergent" behavior I imagine is just a result of finding new features or sets of features.

I agree with your interpretation. There is something there to be learned for sure, but I’m doubtful whatever that thing is will be a breakthrough in machine learning or optimization, nor that it will come by applying the tools of analysis. The idea of “emergence” is interesting although vague and bordering on unscientific. Maybe complexity theory, graph theory, and information theory might provide some insights. But in the end, I would guess those insights impact will be limited to tricks that can be used to engineer marginally better architectures or marginally faster training methods.
I don't understand many of most of these words (highest I got was college calculus) but this sounds interesting to me.
It's all really just basic calculus, with a couple nifty tricks layered on top:

1) Create a bunch of variables and initialize them to random values. We're going to add and multiply these variables. The specific way that they're added and multiplied doesn't matter so much, though it turns out in practice that certain "architectures" of addition and multiplication patterns are better than others. But the key point is that it's just addition and multiplication.

2) Take some input, or a bunch of numbers that convey properties of some object, say a house (think square feet, number of bedrooms, number of bathrooms, etc) and add/multiply them into the set of variables we created in step 1. Once we plug and chug through all the additions and multiplications, we get a number. This is the output. At first this number will be random, because we initialized all our variables to random numbers. Measure how far the output is from the expected value corresponding to the given inputs (say, purchase price of the house). This is the error or "loss". In the case of purchase price, we can just subtract the predicted price from the expected price (and then square it, to make the calculus easier).

3) Now, since all we're doing is adding and multiplying, it's very straight-forward to set up a calculus problem that minimizes the error of the output with respect to our variables. The number of multiplication/addition steps doesn't even matter, since we have the chain rule. It turns out this is very powerful: it gives us a procedure to minimize the error of our system of variables (i.e. model), by iteratively "nudging" the variables according to how they affect the "error" of the output. The iterative nudging is what we call "learning". At the end of the procedure, rather than producing random outputs, the model will produce predictions of house prices that correlate with the distribution input square footage, bedrooms, bathrooms, etc. we saw in the training set.

In a sense, ML and AI are really just the next logical step of calculus once we have big data and computational capacity.

Calculus is all you need! Neural nets are trained to minimize their errors (what they actually output vs what we want them to output). When we build a neural net we know the function corresponding to the output error, so training them (finding the minimum of the error function) is done just by following the gradient (derivative) of the error function.
I think there are still open questions about this that are worth asking.

It is clear enough that following gradients of a bounded differentiable function can bring you to a local minimum of the function (unless I guess if there’s a path that heads away from starting location, going off to infinity, along which the function is always decreasing, asymptotically approaching some value, but this sort of situation can be prevented by adding loss terms that penalize parameters being too big).

But, what determines whether it reaches a global minimum? Or, if it doesn’t reach a global minimum, what kinds of local minima are there, and what determines which kinds it is more likely to end up in? Does including momentum and stochastic stuff in the gradient descent influence the kinds of local minima that are likely to be approached? If so, in what way?

Local minima aren't normally a problem for neural nets since they usually have a very large number of parameter, meaning that the loss/error landscape has a correspondingly high number of dimensions. You might be in a local minima in one of those dimensions, but the probability of simultaneously being in a local minima of all of them is vanishingly small.

Different learning rate schedules, as well as momentum/etc, can also help getting stuck for too long in areas of the loss landscape that many not be local minima, but may still be slow to move out of. One more modern approach is to cycle between higher and lower learning rates rather than just use monotonically decreasing ones.

I'm not sure what latest research is, but things like batch size and learning rate can certainly effect the minimum found, with some resulting in better generalization than others.

LLMs really get those mirror neurons firing and people tend to anthropomorphize them a bit too much.
Indeed, hopefully they can be diverted from interest in LLMs towards actual science, like the neuroscience which revealed the existence of said mirror neurons.
You are missing one important point.

Your network can learn some dataset very well. However, that doesn't say anything about how well it generalizes, and thus how useful your network is.

Your point is a salient one. It would be useful if we could provide guarantees/bounds on generalization, or representation power, or understand how brittle a model is to shifts in the data distributions. Are these questions of the kind that are answered in part by the authors? I haven’t read the manuscript, but the title doesn’t indicate this is the aim of the research, but it indicates an eye to something much broader and vague (“learning”).
The title is bad on lots of levels but also doesn't match the article, and further less the original paper.