Hacker News new | ask | show | jobs
by Chabsff 213 days ago
Yeah, but that's their interface. That informs surprisingly little about their inner workings.

ANNs are arbitrary function approximators. The training process uses statistical methods to identify a set of parameters that approximate the function as best as possible. That doesn't necessarily mean that the end result is equivalent to a very fancy multi-stage linear regression. It's a possible outcome of the process, but it's not the only possible outcome.

Looking at a LLMs I/O structure and training process is not enough to conclude much of anything. And that's the misconception.

2 comments

> Yeah, but that's their interface. That informs surprisingly little about their inner workings.

I'm not sure I follow. LLMs are probabilistic next-token prediction based on current context, that is a factual, foundational statement about the technology that runs all LLMs today.

We can ascribe other things to that, such as reasoning or knowledge or agency, but that doesn't change how they work. Their fundamental architecture is well understood, even if we allow for the idea that maybe there are some emergent behaviors that we haven't described completely.

> It's a possible outcome of the process, but it's not the only possible outcome.

Again, you can ascribe these other things to it, but to say that these external descriptions of outputs call into question the architecture that runs these LLMs is a strange thing to say.

> Looking at a LLMs I/O structure and training process is not enough to conclude much of anything. And that's the misconception.

I don't see how that's a misconception. We evaluate all pretty much everything by inputs and outputs. And we use those to infer internal state. Because that's all we're capable of in the real world.

Then why not say "they are just computer programs"?

I think the reason people don't say that is because they want to say "I already understand what they are, and I'm not impressed and it's nothing new". But what the comment you are replying to is saying is that the inner workings are the important innovative stuff.

> Then why not say "they are just computer programs"?

LLMs are probabilistic or non-deterministic computer programs, plenty of people say this. That is not much different than saying "LLMs are probabilistic next-token prediction based on current context".

> I think the reason people don't say that is because they want to say "I already understand what they are, and I'm not impressed and it's nothing new". But what the comment you are replying to is saying is that the inner workings are the important innovative stuff.

But we already know the inner workings. It's transformers, embeddings, and math at a scale that we couldn't do before 2015. We already had multi-layer perceptrons with backpropagation and recurrent neural networks and markov chains before this, but the hardware to do this kind of contextual next-token prediction simply didn't exist at those times.

I understand that it feels like there's a lot going on with these chatbots, but half of the illusion of chatbots isn't even the LLM, it's the context management that is exceptionally mundane compared to the LLM itself. These things are combined with a carefully crafted UX to deliberately convey the impression that you're talking to a human. But in the end, it is just a program and it's just doing context management and token prediction that happens to align (most of the time) with human expectations because it was designed to do so.

The two of you seem to be implying there's something spooky or mysterious happening with LLMs that goes beyond our comprehension of them, but I'm not seeing the components of your argument for this.

> But we already know the inner workings.

Overconfident and wrong.

No one understands how an LLM works. Some people just delude themselves into thinking that they do.

Saying "I know how LLMs work because I read a paper about transformer architecture" is about as delusional as saying "I read a paper about transistors, and now I understand how Ryzen 9800X3D works". Maybe more so.

It takes actual reverse engineering work to figure out how LLMs can do small bits and tiny slivers of what they do. And here you are - claiming that we actually already know everything there is to know about them.

I never claimed we already know everything about LLMs. Knowing "everything about" anything these days is impossible given the complexity of our technology. Even antennae, a centuries old technology, is something we're still innovating on and don't completely understand in all domains.

But that's a categorically different statement than "no one understands how an LLM works", because we absolutely do.

You're spending a lot of time describing whether we know or don't know LLMs, but you're not talking at all about what it is that you think we do or do not understand. Instead of describing what you think the state of the knowledge is about LLMs, can you talk about what it is that you think that is unknown or not understood?

I think the person you are responding to is using a strange definition of "know."

I think they mean "do we understand how they process information to produce their outputs" (i.e., do we have an analytical description of the function they are trying to approximate).

You and I mean, we understand the training process that produces their behaviour (and this training process is mainly standard statistical modelling / ML).

In short, both sides are talking past each other.

> Saying "I know how LLMs work because I read a paper about transformer architecture" is about as delusional as saying "I read a paper about transistors, and now I understand how Ryzen 9800X3D works". Maybe more so.

Which is to say, not delusional at all.

Or else we have to accept that basically hardly anyone "understands" anything. You set an unrealistic standard.

Beginners play abstract board games terribly. We don't say that this means they "don't understand" the game until they become experts; nor do we say that the experts "haven't understood" the game because it isn't strongly solved. Knowing the rules, consistently making legal moves and perhaps having some basic tactical ideas is generally considered sufficient.

Similarly, people who took the SICP course and didn't emerge thoroughly confused can reasonably be said to "understand how to program". They don't have to create MLOC-sized systems to prove it.

> It takes actual reverse engineering work to figure out how LLMs can do small bits and tiny slivers of what they do. And here you are - claiming that we actually already know everything there is to know about them.

No; it's a dismissal of the relevance of doing more detailed analysis, specifically to the question of what "understanding" entails.

The fact that a large pile of "transformers" is capable of producing the results we see now, may be surprising; and we may lack the mental resources needed to trace through a given calculation and ascribe aspects of the result to specific outputs from specific parts of the computation. But that just means it's a massive computation. It doesn't fundamentally change how that computation works, and doesn't negate the "understanding" thereof.

Understanding a transistor is an incredibly small part of how Ryzen 9800X3D does what it does.

Is it a foundational part? Yes. But if you have it and nothing else, that adds up to knowing almost nothing about how the whole CPU works. And you could come to understand much more than that without ever learning what a "transistor" even is.

Understanding low level foundations does not automatically confer the understanding of high level behaviors! I wish I could make THAT into a nail, and drive it into people's skulls, because I keep seeing people who INSIST on making this mistake over and over and over and over and over again.

What do you mean? what do you think statistical modelling is?

I am very confused by your stance.

The aim of the function approximation is to maximize the likelihood of the observed data (this is standard statistical modelling), using machine learning (e.g., stochastic gradient decent) on a class of universal function approximators is a standard approach to fitting such a model.

What do you think statistical modelling involves?