Hacker News new | ask | show | jobs
by Paul-Craft 1101 days ago
This seems like just another way of saying that when you train an LLM on a text, its weights incorporate the tokens in that text, which is nothing really profound.

I think the real magic here comes from the fact that LLMs are a specialized sort of neural network, and that neural networks are universal approximators [0]. In other words, LLMs are general learners because they are neural networks.

This is also not particularly profound, except that there are mathematical proofs of the universal approximation theorem that give us insight into why it must be so.

---

[0]: https://en.wikipedia.org/wiki/Universal_approximation_theore...

2 comments

The ingredients you need for training a useful machine learning model are expressivity, learnability, and generalization. Many methods are universal approximators but that only takes care of the first ingredient. Arguably the reason neural networks are so successful is that they can offer a good balance between the three.

Before transformers we built different neural network architectures for each domain. These architectures offered better inductive biases for their respective domains and thus traded off some of the expressivity for better learnability and generalization.

Nowadays the best architectures seem to be merging towards transformers. They appear to offer more generally useful inductive biases and thus a better trade-off between the three ingredients than the earlier architectures.

A lot of universal approximators are piss poor at general learning. It's taken a lot of hard work and clever people to get LLM's to where they are. It's not as simple as neural network and done.
Current LLM’s are also piss poor general learners, they are however really good at learning specific things which people value highly.
Some 15 years ago, textbooks taught that multi level perceptrons (fully connected feed forward network) with one hidden layer were sufficient because they were universal approximators. That thought kinda held back the field for a long time. Going against that dogma was so revolutionary that new paradigm was given its own name: deep learning.

Just because you can find some gotcha counterexample LLM's struggle with doesn't invalidate that we've come a very long way.

I think that was largely a misunderstanding. 20+ years ago I took an AI class that mentioned using multiple levels was useful for training neural networks. It also mentioned a 2 layer network was only a universal approximator given arbitrarily large numbers of nodes which again seems to be forgotten about.

Though the teacher worked in industry for a while which may have been relevant as we didn’t focus that much on theory.

PS: Deep learning was also more about improving computational power than some major theoretical advancement.

Nah keep hearing this, was doing multilayer in 90s, the problem was my machine didn’t even have a floating point unit, had to hand roll my own fixed point math and cpu was about 100mhz
What's the destination, we've come along way, and where do you think we're going?
Correction: hard work, clever people, and massive increases in computational power. I'm sure all three matter quite a lot here.

I'm not saying that if your goal is to come up with a usable general learning algorithm that it is just "as simple as neural network and done." What I'm saying is the converse: that the general learning capabilities of LLMs are most likely explained by the fact that, well, they are general learners, via the universal approximation theorem.

Your other comment, I think, suggests why we're just now starting to see more general learning capabilities out of neural networks, when the theory says that a single hidden layer is enough: with a single hidden layer, you really need to get all the weights pretty close to "right" to see general learning/universal approximator behavior. When you have more than one hidden layer, then some of your weights can be wrong, as long as the errors are corrected in later layers.

Now, I'm not an AI researcher or even anyone who works anywhere near this area, but I did take a course or two in grad school, and this seems at least intuitively plausible to me. If there are researchers in the field reading this, I'd definitely like to hear their takes, because I'm totally open to being completely wrong here. I'd rather be one of the lucky 10,000 than just have this half-baked idea that seems right. :-)

Hardware matters most. No matter how clever there’s no storing such large parameter sets on an Intel 286 with 4MB RAM.

No matter how clever the programmer there’s no encoding GPT4 with that. It was the hardware constraints that required programmers to be clever to begin with. These days it’s much more “copy paste the math directly because our data set is so robust and our hardware and networks so performant clever low level hacks don’t matter.”

Especially at big tech where they’ve used their own AI to guide them; the ability to just ask an ML system to simplify math has existed for a few years now, we’ve all seen how clever outputs were set aside for safe linear hacking.

Truly clever work is occurring in more traditional sciences like chemistry and biology these days.