Hacker News new | ask | show | jobs
by pgspaintbrush 1097 days ago
Author here. First off, thank you for reading and for your thoughts. I provided examples that I thought would be intuitive for humans to help folks understand that an understanding of the underlying phenomena is useful for next token prediction (I've added this as a note). Could you share what part of the article came across as suggesting that LLMs "magically" acquire whatever ability helps them to predict? I'd like to make that section clearer, so that doesn't come across.

Re: "LLMs are not particularly good at arithmetic". There are published results that show that LLMs using certain techniques reach close to 100% accuracy on 8-digit addition: https://arxiv.org/pdf/2206.07682.pdf. There are also recent results from OpenAI where their model obtained solid results on high school math competition problems, which are harder than arithmetic: https://openai.com/research/improving-mathematical-reasoning... I haven't looked into counting syllables or recognizing haikus but I bet that this is a result of tokenization and not an inability of the model to create a representation of the underlying phenomena.

1 comments

Thanks for responding to my comment.

I'm not an expert in the field, but, there are lots of previous algorithms for predicting the next token in a series (Markov chains, autocomplete). None of them felt so much pressure to make an accurate prediction that they had no alternative but to teach themselves arithmetic! It seems what is different about LLMs (as far as the post goes) is that we can anthropomorphize them.

More seriously, I guess I just feel like a meaningful sketch of an explanation for why algorithm X (where X is LLMs in this case) for continuing a piece of text is good at problem A should involve something about X and A. Because it is clearly highly dependent on the exact values of X and A, not just whether A can be posed as a text completion problem and humans would prefer the computer learn to solve the underlying problem to produce better text. For example, it could help to imagine a mechanism by which algorithm X could solve problem A. The closest thing to a mechanism (something algorithm X, i.e. LLMs, might be doing that's special) in the post is the talk of necessity being the mother of invention and "a deeper understanding of reality simplifies next-token prediction tasks," and the suggestion that if you were an LLM you might want to use "the rules of addition."

It's true that modeling arithmetic in some way could help a LLM account for known arithmetic problems in the training data, which could help it on unseen arithmetic problems, but what problems an LLM can solve is a function of what it can model. Anything an LLM can't model or can't do, it just doesn't. LLMs are really bad at chess, for example. The patterns of digits in addition may be similar enough to the hierarchical patterns in language the LLM is modeling. But it's not clear if the LLM is using the "rules of addition" or not. As far as I know, we don't actually understand why LLMs are able to store so much factual information, produce such coherent stories, and do the specific things they can do.

> what problems an LLM can solve is a function of what it can model.

Well said. The model that LLM has is very simple: If text X precedes current conversation then the most likely continuation of discussion is, according to the model held by LLM, Y. Right?

So the point is LLM does not create models. It has only a single model based on probabilities of text-sequences, created by its programmers. So it can (mostly?) only solve the problem of what would be a good textual response to an earlier text. It can do it well but most difficult problems don't fall into that category of "having a great chat".

A lot of things that LLMs can already do reliably don't fall into the category of "having a great chat" either. Examples include retrieving data from external sources using commands (known as "plugins" in ChatGPT / Langchain) or writing working code to calculate information needed for answers or to create artifacts, such as charts.

Yes, all of this stems from the task of continuing text. However, more and more, this is veering into the category of behavior. I don't mean "conscious behavior," but "behavior" nevertheless. It's surprising, but it is also the reality in which we currently live.

What would be an example of a difficult problem?
Hmm, yea, I agree with you on several points. For one, we don't fully understand the internal mechanisms of LLMs. I'm also with you on Markov chains and autocomplete tools not having an understanding of the underlying concepts. They merely use statistical patterns in the data.

Based on what you've said, it sounds like your take is that unless we can specify the exact mechanism by which LLMs understand, we have no business saying that they understand. In a lot of cases, this is a reasonable approach. In many areas, if someone tells you X, and you ask for a mechanism of action, and they can't produce one, you have solid grounds for thinking they're bullshitting.

But this case isn't quite the same. We know that LLMs learn to represent their inputs in a high-dimensional vector space (embeddings) and learn the relationships between those vectors. We also see them effectively solve problems in a variety of domains using this representation. I think these two ingredients: having a semantic representation and being able to effectively solve problems amount to something like "understanding." The lack of both properties is why I'd say Markov chains and autocomplete tools don't "understand" -- they haven't learned an effective representation of the underlying phenomena. (I'd also argue this is similar to us as humans. We don't have a good understanding of the human brain or precise mechanisms of action underlying thought. All we know is we as humans have semantic representations and can effectively solve problems.)

small note on your chess point: it now looks like chat gpt 3.5 can achieve draws against stockfish 8: https://marginalrevolution.com/marginalrevolution/2023/06/th...

bigger note on your chess point: this example illustrates that LLMs are "semi-decidable." We thought they were bad at chess, but we just hadn't discovered the right way to prompt. More generally, we can confirm when an LLM is good at X because we feed it a prompt that produces performance in X, but given the size of the input space we're dealing with here, we can't confirm that LLMs are bad at X just because we haven't seen them do well at it. Maybe we just haven't discovered the right prompt. (These input spaces are massive, by the way. ChatGPT-3.5, for example, has a context window of 4,096 tokens, so if we were considering only the English alphabet, we're looking at more than 26^{4,096} possible inputs.)

Two points in response to this:

I think it's a category error to call word embedding in a vector space "semantic" representation when discussing concepts like understanding. Semantics deals with the referents of words, but in this case there are no referents, merely a list of representational tokens which are defined as being "close in meaning" to the original due to proximity in text or some other structural characteristic. We call the embedding "semantic" because it is useful for human semantic purposes as we can mechanize some translations from one vector to another and receive a useful response that we then assign meaning to, but that usefulness doesn't indicate that the machine itself has any access to the referents of the tokens it's processing or semantic understanding. Put more simply, "semantics" does not merely mean the relationship between several ungrounded tokens, but that is all a vector embedding can accomplish.

Secondly, I think in the chess thread, the prompt being "engineered" in the example is extremely complex and constrains the output space sufficiently to produce high-quality results, but you start to wonder at what point the LLM is not doing most of the work. Meanwhile deeper in the thread we learn that even this prompting is not reliable and occasionally requires giving feedback that the move was bad(!) and repetition to achieve good results "the majority of the time in less than 3 tries". You can see where the practical problem arises, if we want to rely on LLMs for answers we don't already know. Claiming that we have a "general" function that "just" requires arbitrarily varying the input over an uncountably large space until you achieve the desired result is akin to saying f(x) = rand() * x is a universal computer as long as you find the right x. The ad absurdum version of the chess example is running Stockfish, sending a prompt that contains the Stockfish move and a request to repeat it, and then claiming that the LLM draws against Stockfish. However as we have seen with tokens like "SolidGoldMagikarp", LLMs are not even yet capable of reliably implementing the identity function, so I am not sure we can even say this.