Hacker News new | ask | show | jobs
by nsagent 983 days ago
As an PhD student in NLP who's graduating soon, my perspective is that language models do not demonstrate "reasoning" in the way most people colloquially use the term.

These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Unfortunately, this means the "reasoning" exhibited by language models is limited: if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

That said, I do think adding reasoning capabilities is an active area of research, but we don't have a clear time horizon on when that might happen. Current prompting approaches are stopgaps until research identifies a promising approach for developing reasoning, e.g. combining latent space representations with planning algorithms over knowledge bases, constraining the logits based on an external knowledge verifier, etc (these are just random ideas, not saying they are what people are working on, rather are examples of possible approaches to the problem).

In my opinion, language models have been good enough since the GPT-2 era, but have been held back by a lack of reasoning and efficient memory. Making the language models larger and trained on more data helps make them more useful by incorporating more facts with increased computational capacity, but the approach is fundamentally a dead end for higher level reasoning capability.

5 comments

Congrats on the upcoming PhD!

I'm curious where you are drawing your definition or scope for 'reasoning' from?

For example, in Shuren The Neurology of Reasoning (2002) the definition selected was "the ability to draw conclusions from given information."

While I agree that LLMs can only process token to token and that juggling context is critical to effective operation such that CoT or ToT approaches are necessary to maximize the ability to synthesize conclusions, I'm not quite sure what the definition of reasoning you have in mind is such that these capabilities fall outside of it.

The typical lay audience suggestion that LLMs cannot generate new information or perspectives outside of the training data isn't the case, as I'm sure you're aware, and synthesizing new or original conclusions from input is very much within their capabilities.

Yes, this has to happen within a context window and occurs on a token by token basis, but that seems like a somewhat arbitrary distinction. Humans are unquestionably better at memory access and running multiple subprocesses on information than an LLM.

But if anything, this simply suggests that continuing to move in the direction of multiple pass processing of NLP tasks with selective contexts and a variety of fine tuned specializations of intermediate processing is where practical short term gains might lie.

As for the issue of new domains outside of training data, I'm somewhat surprised by your perspective. Hasn't one of the big research trends over the past twelve months been that in context learning has proven more capable than was previously expected? I'd agree that a zero shot evaluation of a problem type that isn't represented in a LLMs training data is setting it up for failure, but the capacity to extend in context examples outside of training data has proven relatively more successful, no?

> These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Is it not possible that this is essentially how our brains do it too? Attempt to plan by branching out to related ideas until they contain an answer. Any of these statements that AI can't be on track to reason like a human because of X seem to come with an implication that we have such a good model of the human brain that we know it doesn't X. But I'm not an expert on neuroscience so in many of these cases maybe that implication is true.

>Is it not possible that this is essentially how our brains do it too?

Is that how you think? Just curious

I think the word "essentially" is important here. I don't think we can observe how we think. How it appears in consciousness is not necessarily real - it might be just a model constructed ex-post.

I do not know that much about AI but I know at least something about cognitive psychology and it seems to me that a lot of claims about LLMs "not actually reasoning" and similar are probably made by CS graduates who have unreflected assumptions about how human thinking works.

I don't claim to know how human thinking works but if there is one thing I would conclude from studying psychology and knowing at least some basics about neuroscience, it would be that "it's not how it appears to us".

Nobody knows how human reasoning actually works but if I had to guess (based on my amateurish mental model of the functioning of the human brain), I would say that it is probably a lot closer to LLMs and a lot less rational than is commonly assumed in discussions like this one.

Maybe don't assume that PhD-level NLP researchers are out of touch on cognitive neuroscience topics related to language understanding. The latest research seems to indicate that language production and understanding exist separately from other forms of cognitive capacity. This includes people with global aphasia (no language ability) being able to do math, understand social situations, appreciate music, etc.

If you want to follow this more closely, I'd recommend the work of Evelina Fedorneko a cognitive neuroscientist at MIT who specializes in language understanding.

Check out these talks for more details: https://youtu.be/TsoQFZxrv-I?t=580 https://youtu.be/qublpBRtN_w

What this means in the context of LLMs is that next word prediction alone does not provide the breadth of cognitive capacity humans exhibit. Again, I'd posit GPT-2 is plenty capable as an LM, if combined with an approach to perform higher-level reasoning to guide language generation. Unfortunately, what that system is and how to design it currently eludes us.

First, you are right I should not assume anyone's knowledge (or lack thereof). It just popped into my mind as something that could explain the thing that's been puzzling me for months - what are people talking about when they say that LLMs are not actually reasoning, or Stable Diffusion is not actually creating? I wish I had not included that assumption and was inquisitive instead. Let me try again.

Maybe I diverted your focus the wrong way when I used LLMs as an example - what if I used more general term "neural network"? I said LLMs because this thread is about LLMs but let me clarify what I meant:

The thing that interests me in this thread is the claim that LLMs are "not capable of actually reasoning". Whether you agree with it depends on your mental model of actual reasoning, right?

My model of reasoning: the fundamental thing about it is that I have a network of things. The signal travels through the network guided by the weight of connections between them and fires some pattern of the things. That pattern represents something. Maybe it is a word in the case of LLMs (or syllable or whatever the token actually is - let's ignore those details for now) or a thought in the case of my brain (I was not saying people reason in language) - the resulting "token" can be many things, I imagine (like some mental representation of objects and their positions in spatial reasoning) - those are the specifics, but "essentially", the underlying mechanism is the same.

In my mental model, there is nothing fundamental that distinguishes what LLMs do from the "actual reasoning". If you have enough compute and good enough training data, you can create LLM reasoning as well as humans - that is my default hypothesis.

If I understand your position, you would not agree with that, correct? I am not claiming you are wrong - I know way too little for that. I would just be really curious - what is your mental model of actual reasoning? What does it have that LLMs do not have?

I know you mentioned that "these models have no capacity to plan ahead" - I am not sure I understand what you mean by that. Is this not just a matter of training?

BTW, I have talked about this topic before and some people apparently see conscience as a necessary part of actual reasoning. I do not - do you?

I don’t think we are conscious about how the language center correlates with our memories and then predicts the strings of words coming out.
> if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

True. But look at the Phi-1.5 model - it punches 5x above its weight limit. The trick is in the dataset:

> Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+ 23]).

> We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.

https://arxiv.org/pdf/2309.05463.pdf

Synthetic data has its advantages - less bias, more diverse, scalable, higher average quality. But more importantly, it can cover all the permutations and combinations of skills, concepts, situations. That's why a small model just 1.5B like Phi was able to work like a 7B model. Usually at that scale they are not coherent.

Are you going to school in Langley, Virginia?
NSA is more commonly associated with Fort Meade, MD, for what that's worth.
> These models have no capacity to plan ahead

How would you describe the behavior of "GPT Advanced Data Analysis"?