Hacker News new | ask | show | jobs
by eggdaft 766 days ago
I think what this argument is missing is the emergent properties of the LLM.

In order to “predict the next word”, the LLM doesn’t just learn the most likely word from a corpus for the preceding string. If that were true, it would not generalise outside of its training set.

The LLM learns about the structure of the language, the context, and in the process of doing so constructs a model of the world as represented by words.

Admittedly the model is still limited, but it seems to me that there is something more insightful to be gleaned here: that given enough data, and sufficient pressure to learn, that excelling at scale on a relatively simple task leads indirectly to a form of intelligence.

For me the biggest takeaway of LLMs might be that “intelligence is pretty cheap, actually” and that the human brain is not so remarkable as we’d like to believe.

3 comments

Each word is taken to a distribution over words this is where the illusion of "context" largely comes from. eg., "cat" is replaced by a weighted: (cat, kitten, pet, mammal, ...) which is obtained via frequencies in a historical dataset.

So technically the LLM is not doing P(next word |previous word) -- but rather, P(associated_words(next word)|assocated_words(previous), associated_words(previous_-1), ...).

This means its search space for each conditional step is still extremely large in the historical corpus, and there's more flexibility to reach "across and between contexts" -- but it isnt sensitive to context.. we just arranged the data that way.

Soon enough people with enough money will build diagnostic (XAI) models of LLMs that are powerful enough to show this process at work over its training data.

To visualize roughly, imagine you're in a library and you're asked a question. The first word selects a very large number of pages across many books (and whole books), the second word selects both other books, and pages across the books you have. Keep going.. each more word you're ask, you convert to a set of words, and find more pages and books and also get narrower paragraph samples from the ones you have. Now finally, with total set of pages and paragraphs etc. you have to hand at the end of the question, you then find the word most probable following the other.

This process will eventually be visualised properly, with a real-world LLM, but it'll take a significant investement to build this sort of explanatory model.. since you need to reverse from weights to training data across the entire inference process.

The context comes from the attention mechanism, not from word embeddings.
Run attention on an ordinal word embedding and see what happens
Well yes, necessary but not sufficient, obviously.
> and that the human brain is not so remarkable as we’d like to believe.

Well, it IS pretty seamlessly integrated with a very impressive suite of sensors.

Yes, our human sensor fusion is remarkable. The input signal of say our eyes is warped, upside down and low resolution apart from a tiny patch that races across the field of vision to capture high resolution samples (saccades). Yet, to us, it feels seamless and encompassing.
Bingo.

When I write some 100% bespoke code that is rather hastily composed and then paste it all into ChatGPT4 asking it to "refactor this code with a focus on testability and maintainability" and not only does it do so, but it does a pretty damn good job about it, it feels rather reductive to say "it's just providing the next most likely word".

I mean, maybe that's how it works, but that statistical output clearly involves modeling what my code does and what I want it to do. Rather than make me think LLMs are a cheap trick, it just has me thinking, "shit - maybe that's all I do too."

Averaged faces are beautiful, averaged code is clean. Not sure how that is hard to believe. Just don't extrapolate it too far or it will get strange.