Hacker News new | ask | show | jobs
by quantadev 618 days ago
Most LLMs aren't even using a "curve" yet at all, right? All they're using is a series of linear equations because the model weights are a simple multiply and add (i.e. basic NN Perceptron). Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.

I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"

It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".

2 comments

The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so

Ax = y

Adding another layer is just multiplying a different set of weights times the output of the first, so

B(Ax)= y

If you remember your linear algebra course, you might see the problem: that can be simplified

(BA)x = y

Cx = y

Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.

To prevent this collapse, a non linear function must be introduced between each layer.

Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.

But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.

The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.

> nothing but linearity

No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.

Nonlinearity somewhere is fundamental, but it doesn't need to be between each layer. You can, for instance, project each input to a higher dimensional space with a nonlinearity, and the problem becomes linearly separable with high probability (cf Cover's Theorem).

So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.

Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).

> The non linearities are not there primarily to keep the outputs in a given range

Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?

All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.

> That's the only non-linearity I'm aware of.

"only" is doing a lot work here because that non-linearity is enough to vastly expand the landscape of functions that an NN can approximate. If the NN was linear, you could greatly simplify the computational needs of the whole thing (as was implied by another commenter above) but you'd also not get a GPT out of it.

All the trainable parameters are just slopes of lines tho. Training NNs doesn't involve adjusting any inputs to non-linear functions. The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1. There's no "magic" or "knowledge" in the tanh smashing. All the magic is 100% in the weights. They're all linear. The amazing thing is that all weights are linear slopes of lines.
With a ReLU activation function, rather than a simple linear function of the inputs, you get a piecewise linear approximation of a nonlinear function.

ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.

(This is a lot easier to see on a whiteboard!)

ReLU technically has a non-linearity at zero, but in some sense it's still even MORE linear than tanh or sigmoid, so it just demonstrates even better than tanh-type squashing that all this LLM stuff is being done ultimately with straight line math. All a ReLU function does is choose which line to use, a sloped one or a zero one.
> squash an output into a range

This isn't the primary purpose of the activation function, and in fact it's not even necessary. For example see ReLU (probably the most common activation function), leaky ReLU, or for a sillier example: https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe

You can change the subject by bringing up as many different NN architectures, Activation Functions, etc. as you want. I'm telling you the basic NN Perceptron design (what everyone means when they refer to Perceptrons in general), has something like a `tanh` and not only is it's PRIMARY function to squash a number, that's it's ONLY function.
> It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".

In Ye Olden days (the 90’s) we used to approximate non-linear models using splines or seperate slopes models - fit by hand. They were still linear, but with the right choice of splines you could approximate a non-linear model to whatever degree of accuracy you wanted.

Neural networks “just” do this automatically, and faster.

In college (BSME) I wrote a computer program to generate cam profiles from Bezier curves. It's just a programming trick to generate curves from straight lines at any level of accuracy you want just by letting the computer take smaller and smaller steps.

It's an interesting concept to think of how NNs might be able to exploit this effect in some way based on straight lines in the weights, because a very small number of points can identify avery precise and smooth curves, where directions on the curve might equate to Semantic Space Vectors.

In fact now that I think about it, for any 3 or more points in Semantic Space, there would necessarily be a "Bezier Path" which would have genuine meaning at every point as a good smooth differentiable path thru higher dimensional space to get from one point to another point while "visiting" all intermediate other points. This has to have a direct use in LLMs in terms of reasoning.