Hacker News new | ask | show | jobs
by quantadev 598 days ago
> can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent.

Makes you wonder if we're training LLMs the hard way. For example, if computers had been invented before Calculus, we'd have been using "Numerical Integration" (iterating the differential squares to sum up areas, etc) and "Numerical Differentiation" (ditto for calculating slopes).

So I wonder if we're simply in a pre-Calculus-like phase of NN/Perceptrons, where we haven't yet realized there's a mathematical way to "solve" a bunch of equations simultaneously and arrive at the best (or some local minima) model weights for a given NN architecture and set of training data.

From a theoretical standpoint it IS a black box problem like this where the set of training data goes in, and an array of model weights comes out. If I were to guess I'd bet there'll be some kind of "random seed" we can add as input, and for each seed we'll get a different (local minima/maxima for model weights).

But I'm not a mathematician and there may be some sort of PROOF that what I just said can definitely never be done?

3 comments

NNs have complex non-convex loss functions that don't admit a closed-form solution. Even for small models, it can be shown that it's an NP-complete problem. In fact, even for linear regression (least squares), which has a closed-form solution, it can be computationally cheaper to run gradient descent since finding the closed form solution requires you to calculate and invert a large matrix (X^T X).
Which in some sense is intuitive: any closed form that can model general computation to any significant degree should be hard: if it weren't, you could encode your NP-complete problem into it, solve it in an efficient closed form, and collect your Fields medal for proving P = NP.
Intuition is often wrong, even for high IQ people, like your average HN user. lol.

For a long time it was intuitive that you cannot find the area under arbitrary functions, but then Calculus was invented, showing us a new "trick", that was previously unfathomable, and indistinguishable from magic.

I'm just not sure mankind's understanding of Mathematics is out of new "tricks" to be learned. I think there are types of algorithms today that look like the require N-iterations to get X-precision, when in reality we might be able to divide N by some factor, for some algorithms, and still end up with X-precision.

> I'm just not sure mankind's understanding of Mathematics is out of new "tricks" to be learned.

This is my opinion also as it relates to AI/ANN. Things I read about how scientists see the brain shifting due to learning (minimum energy of network type stuff) seem like the brain has some functions figured out that we haven't identified yet.

Maybe it's math already fully understood just not applied well to ANN's, but maybe there's some secret sauce in there.

One reason to believe there's even new low hanging fruit (that doesn't even require new math) is how simple and trivial the "Attention Heads" structure of the Transformer architecture really is. It's not advanced at all. It was just a great ideal that panned out that pretty much any creative AI researcher could've thought up after smokin' a joint. lol. I mean someone could do trivial experiments with different Perceptron network structuring and end up revolutionizing the world.

I think things are gonna get interesting real quick once LLMs themselves start "self experimenting" with writing code for different architectures.

Thanks for that great clarification. I had seen all those words before, but just not in that particular order. haha.

Maybe our only hope of doing LLM training runs in a tiny amount of time will be from Quantum Computing or even Photonic (wave-based) Computing.

There are actually neural networks with explicit optimization layers but I don’t think these have really had much success.
I just have a hunch we're in early days still even with Transformers architectures. The MLP (Perceptron) is such a simple mathematical structure and mostly doing linear stuff (tons of multiplications, then a few adds, and a squashing-type activation function), plus the attention heads add-on from the Transformers paper too, of course (and other minor things) but ultimately it's a very easy to understand data structure so it's hard for me to believe there's not massive leaps and bounds that we can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.
> We can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.

Afaik the most important benefit of transformers aren't their “performance” (in the sense of ability to perform their tasks) but their scalability which come from their ability to be trained and evaluated efficiently on big GPU clusters, which isn't something you can do with recurrent neural networks.

And then, if I understood correctly, the benefit of state-space models being that you can train them in parallel and run them in a recurrent fashion, making inference cheaper than transformers especially when context size grow.

The biggest thing I had understood about the Transformers Paper (Attention is all you Need) is how the "attention heads" vectors are wired up in such a way as to allow words to be "understood" in the proper context. In other words "see spot run" is different from "run a computer program" has dramatically different but specific context for the word "run".

It was also my understanding that without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. We needed both very large models and words put into semantic context in semantic space.

Attention heads existed before Transformers, they where used in recurrent neural networks (RNN) to improve their performance. The paper is called “Attention is all you need” because transformers keep the attention head while discarding the RNN part entirely.

Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.

When discussing "Attention Heads" in the context of the Transformers Paper, there's no need to put the word "Self" in front of it, as in "Self-Attention". That's the context in which I used the word Attention above. Something similar to self-attention had pre-existed this paper, but not actual self-attention.

You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles.

There's a whole lotta certainty about even intractable integrals which is lacking in the case of neural nets grappling with noisy incomplete real world data.
There's at least 100 different equally likely interpretations of that particular sequence of words you just wrote.