Hacker News new | ask | show | jobs
by rob74 983 days ago
The article doesn't say that LLMs aren't useful - the "hype" they mean is overestimating their capabilities. An LLM may be able to pass a "theory of mind" test, or it may fail spectacularly, depending on how you prompt it. And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future, but we're not there yet, and (AFAIK) nobody can tell how long it will take to get there.
7 comments

> And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future [...]

I don't think so. When you say "it's not capable of actually reasoning", that's because it's a LLM; and if it "changes in the future", that's because the new system must no longer be a pure LLM. The appearance of reasoning in LLMs is an illusion.

How is the illusion of reasoning different from say “actual reasoning?”
Because it literally can't reason, and it also has no innate agency. Even the most dedicated creators of LLM-based AI technology have clearly and repeatedly stated that these are very sophisticated stochastic parrots with no sense of self. How much easier could it be to see that LLMs like GPT aren't actual thinking machines in the way we humans are?

Yes, many people reason based on pure pattern-matching and repeat opinions not because they've reasoned them but because they're what they've absorbed from other sources, but even the world's most unreasoned human being with at least functional cognition still uses an enormous amount of constant, daily, hourly self-directed decision-making for a vast variety of complex and simple, often completely spontaneous scenarios and tasks in ways that no machine we've yet built on Earth does or could.

Moreover, even when some humans say or "believe" things based on nothing more than what they've absorbed from others without really considering it in depth, they almost always do so in a particularly selective way that fits their cognitive, emotional and personal predispositions. This very selectiveness is a distinctly conscious trait of a self-aware being. Its something LLM's don't have as far as I've yet seen.

In the same way that illusions of anything else differ from the real thing. A wax apple is different from a real apple, even if it's hard to tell them apart sometimes. You may require further investigation to differentiate them (e.g., cutting open the apple or asking the AI to solve tricky reasoning questions), but if you can find a difference, there is a difference.
I have a hunch I am misunderstanding your argument, but does that mean the only way to build a "true reasoning machine" would be to just create a human.

I guess what I'm really asking, what would you expect to observe to make it not illusory?

To distinguish between "is an illusion" and "is not an illusion", you need evidence that isn't observational. The whole point of illusions is that observational evidence is unreliable.

A desert mirage in the distance is an illusion; to the observer, it's indistinguishable from an oasis. You can only tell that it's a mirage by investigating how the appearance was created (e.g. by dragging your thirsty ass through the sand, to the place where the oasis appeared to be).

If one has a reasonable understanding of 2 concepts that make up a larger system. And, such a system has little else in addition to those concepts, one is able to come up with that system by itself. Even though, it has never seen it, or their composition was never explained prior to that logical process.

The illusion happens when, clearly, the alleged reasoning behind how such a system comes to be is based on prior knowledge of the system as a whole. Meaning, its construction/source was within the training data.

That sounds like a good litmus test. Do you have a specific example you've tried?

My opinion is it isn't binary, rather it's a scale. Your example is a point on the scale higher than what it is now.

But perhaps that's too liberal a definition of "reasoning" , no idea.

We seem to move the goalposts on what constitutes human level intelligence as we discover the various capabilities exhibited in the animal kingdom. I wonder if it is/will be the same with AI

I'm really curious, are you able to demonstrate reasoning, not reasoning and the illusion of reasoning in a toy example? I'd like to see what each looks like.
Have you met someone who is full of bullshit? They sound REALLY convincing, except if you know anything about the subject, their statements are just word salad?
Have you met someone who's good at bullshitting their way out of a tough spot? There may be a word salad involved, but preparing it takes some serious skill and brainpower, and perhaps a decent high-level understanding of a domain. At some point, the word salad stops being a chain of words, and becomes a product of strong reasoning - reasoning on the go, aimed at navigating a sticky situation, but reasoning nonetheless.
Are you able to give some examples? I'd like to know what it looks like w r.t. LLMs.
Bullshit has an illusion of reasoning instead of actual reasoning. Basically you give arguments that sounds reasonable on the surface but there is no actual reasoning behind them.
> Bullshit has an illusion of reasoning instead of actual reasoning.

Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning? You could argue that bullshit is fallacious reasoning, "pseudo-reasoning" based on incorrect rules of inference.

But these models don't use any rules of inference; they produce output that resembles the result of reasoning, but without reasoning. They are trained on text samples that presumably usually are the result of human reasoning. If you trained them on bullshit, they'd produce output that resembled fallacious reasoning.

No, I don't think the touchstone for actual reasoning is a human mind. There are machines that do authentic reasoning (e.g. expert systems), but LLMs are not such machines.

> Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning?

None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.

Fallacious reasoning will make you wrong. No reasoning will make you spew nonsense. Truth and lies and bullshit, all require reasoning for the structure of what you're saying to make sense, otherwise it devolves to nonsense.

> But these models don't use any rules of inference

Neither do we. Rules of inference came from observation. Formal reasoning is a tool we can employ to do better, but it's not what we naturally do.

> None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.

Maybe splitting hairs, but I’d argue that the bullshitter is reasoning about what sounds good, and what sounds good needs at least some shared assumptions and resulting logical conclusion to hang its hat on. Maybe not always, but enough of the time that I would still consider reasoning to be a key component of effective bullshit.

That's not the case. It's very much in the realm of "we don't know what's going on in the network."

Rather than a binary it's much more likely that it's a mix of factors going into results that includes basic reasoning capabilities developed from the training data (much like board representations and state tracking abilities developed feeding board game moves into a toy model in Othello-GPT) as well as statistic driven autocomplete.

In fact often when I've seen GPT-4 get hung up with logic puzzle variations such as transparency, it tends to seem more like the latter is overriding the former, and changing up tokens to emoji representations or having it always repeat adjectives attached to nouns so it preserves variation context gets it over the hump to reproducible solutions (as would be expected from a network capable of reasoning) but by default it falls into the pattern of the normative cases.

For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...

As an PhD student in NLP who's graduating soon, my perspective is that language models do not demonstrate "reasoning" in the way most people colloquially use the term.

These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Unfortunately, this means the "reasoning" exhibited by language models is limited: if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

That said, I do think adding reasoning capabilities is an active area of research, but we don't have a clear time horizon on when that might happen. Current prompting approaches are stopgaps until research identifies a promising approach for developing reasoning, e.g. combining latent space representations with planning algorithms over knowledge bases, constraining the logits based on an external knowledge verifier, etc (these are just random ideas, not saying they are what people are working on, rather are examples of possible approaches to the problem).

In my opinion, language models have been good enough since the GPT-2 era, but have been held back by a lack of reasoning and efficient memory. Making the language models larger and trained on more data helps make them more useful by incorporating more facts with increased computational capacity, but the approach is fundamentally a dead end for higher level reasoning capability.

Congrats on the upcoming PhD!

I'm curious where you are drawing your definition or scope for 'reasoning' from?

For example, in Shuren The Neurology of Reasoning (2002) the definition selected was "the ability to draw conclusions from given information."

While I agree that LLMs can only process token to token and that juggling context is critical to effective operation such that CoT or ToT approaches are necessary to maximize the ability to synthesize conclusions, I'm not quite sure what the definition of reasoning you have in mind is such that these capabilities fall outside of it.

The typical lay audience suggestion that LLMs cannot generate new information or perspectives outside of the training data isn't the case, as I'm sure you're aware, and synthesizing new or original conclusions from input is very much within their capabilities.

Yes, this has to happen within a context window and occurs on a token by token basis, but that seems like a somewhat arbitrary distinction. Humans are unquestionably better at memory access and running multiple subprocesses on information than an LLM.

But if anything, this simply suggests that continuing to move in the direction of multiple pass processing of NLP tasks with selective contexts and a variety of fine tuned specializations of intermediate processing is where practical short term gains might lie.

As for the issue of new domains outside of training data, I'm somewhat surprised by your perspective. Hasn't one of the big research trends over the past twelve months been that in context learning has proven more capable than was previously expected? I'd agree that a zero shot evaluation of a problem type that isn't represented in a LLMs training data is setting it up for failure, but the capacity to extend in context examples outside of training data has proven relatively more successful, no?

> These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Is it not possible that this is essentially how our brains do it too? Attempt to plan by branching out to related ideas until they contain an answer. Any of these statements that AI can't be on track to reason like a human because of X seem to come with an implication that we have such a good model of the human brain that we know it doesn't X. But I'm not an expert on neuroscience so in many of these cases maybe that implication is true.

>Is it not possible that this is essentially how our brains do it too?

Is that how you think? Just curious

I think the word "essentially" is important here. I don't think we can observe how we think. How it appears in consciousness is not necessarily real - it might be just a model constructed ex-post.

I do not know that much about AI but I know at least something about cognitive psychology and it seems to me that a lot of claims about LLMs "not actually reasoning" and similar are probably made by CS graduates who have unreflected assumptions about how human thinking works.

I don't claim to know how human thinking works but if there is one thing I would conclude from studying psychology and knowing at least some basics about neuroscience, it would be that "it's not how it appears to us".

Nobody knows how human reasoning actually works but if I had to guess (based on my amateurish mental model of the functioning of the human brain), I would say that it is probably a lot closer to LLMs and a lot less rational than is commonly assumed in discussions like this one.

Maybe don't assume that PhD-level NLP researchers are out of touch on cognitive neuroscience topics related to language understanding. The latest research seems to indicate that language production and understanding exist separately from other forms of cognitive capacity. This includes people with global aphasia (no language ability) being able to do math, understand social situations, appreciate music, etc.

If you want to follow this more closely, I'd recommend the work of Evelina Fedorneko a cognitive neuroscientist at MIT who specializes in language understanding.

Check out these talks for more details: https://youtu.be/TsoQFZxrv-I?t=580 https://youtu.be/qublpBRtN_w

What this means in the context of LLMs is that next word prediction alone does not provide the breadth of cognitive capacity humans exhibit. Again, I'd posit GPT-2 is plenty capable as an LM, if combined with an approach to perform higher-level reasoning to guide language generation. Unfortunately, what that system is and how to design it currently eludes us.

I don’t think we are conscious about how the language center correlates with our memories and then predicts the strings of words coming out.
> if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

True. But look at the Phi-1.5 model - it punches 5x above its weight limit. The trick is in the dataset:

> Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+ 23]).

> We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.

https://arxiv.org/pdf/2309.05463.pdf

Synthetic data has its advantages - less bias, more diverse, scalable, higher average quality. But more importantly, it can cover all the permutations and combinations of skills, concepts, situations. That's why a small model just 1.5B like Phi was able to work like a 7B model. Usually at that scale they are not coherent.

Are you going to school in Langley, Virginia?
NSA is more commonly associated with Fort Meade, MD, for what that's worth.
> These models have no capacity to plan ahead

How would you describe the behavior of "GPT Advanced Data Analysis"?

> it's not capable of actually reasoning

Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.

I don't have access to GPT 4 but I'd be interested to see how it does on a question like this:

"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."

... because on GPT 3.5 the answer begins like the below and then gets worse:

"Let's break down the process step by step:

Initially, you have 50 red balls and 50 blue balls in the container.

1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."

If I was hiring interns this dumb, I'd be in trouble.

EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.

This is such a flawed puzzle. And GPT 4 answers it rightly. It is a long answer but the last sentence is "This is one possible scenario. However, there could be other scenarios based on the order in which balls are drawn. But in any case, the same logic can be applied to find the number of each color of ball left in the container."
The ability to identify that there isn't a simple closed form result is actually a key component of reasoning. Can you stick the answer it gives on a gist or something? The GPT 3.5 response is pure, self-contradictory word salad and of course delivered in a highly confident tone.
> The ability to identify that there isn't a simple closed form result is actually a key component of reasoning.

If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.

https://pastebin.com/r9bNi8GD

GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.

Its answer isn't correct, this isn't a possible ending scenario:

- *Ending Scenario:* - Red Balls (RB): 0 (all have been drawn) - Blue Balls (BB): 50 (none have been drawn) - White Balls (WB): 0 (since no blue balls were drawn, no white balls were added) - Total Balls: 50

> but it is technically correct answer as it said it depends on the order.

It should give you pause that you had to pick not only the line by which to judge the answer but the part of the line. The sentence immediately before that is objectively wrong:

> This is one possible scenario.

But the reasoning is total garbage, right?

It says the number of blue balls drawn is x and the number of red balls drawn is y, and then asserts x + y = 100, which is wrong.

Then it proceeds to "solve" an equation which reduces to x = x to conclude x = 0.

It then uses that to "prove" that y = 100, which is a problem as there are only 50 red balls in the container and nothing causes any more to be added.

It's like "mistakes bad students make in Algebra 1".

I asked GPT4 and it gave a similar response. So then I asked my wife and she said, "do you want more white balls at the end or not?" And I realized as CS or math question we assume that the draw is random. Other people assume that you're picking which ball to draw.

So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."

I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)

I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

Seems like quite a difficult question to compute exactly.

I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.

>I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

These are very good questions that anyone with the ability to reason would ask if given this problem.

"You're holding it wrong."

You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.

Instead, collaborate with it, while giving it the appropriate tools to help you.

I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.

I even got it to save a video of the plot rotating: https://streamable.com/2aphbz

AI can reason! Just not reasonably!
It can reason better than most humans put into the same situation.

This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)

With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!

Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.

A human could get a valid end state most of the time, gpt-4 seems to mess up more than it got it right based on the examples posted here. So to me it seems like gpt-4 is worse than humans.

Gpt-4 with help from a competent human will of course do better than most humans, but that isn't what we are discussing.

>It can reason better than most humans put into the same situation.

On what basis do you allege this? People say the most unhinged stuff here about AI, and it so often goes completely unchallenged. This is a huge assertion that you are making.

This is what I got on a basically brand new OpenAI account: https://chat.openai.com/share/5199c972-478d-406f-9092-061a6b...

All told, I'd say it's a decent answer.

Edit: I took it to completion:https://chat.openai.com/c/6cdd92f1-487a-4e1c-ab94-f2bdbf282d...

These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.

GPT-4:

e composition of what's left in the container.

There's a couple of scenarios, which depend on when you run out of blue balls:

1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.

2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.

Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.

Yeah, that's still pretty much nonsense isn't it?

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.

Just because it fails one test in a particular way doesn’t mean it lacks reasoning entirely. It clearly does have reasoning based on all the benchmarks it passses

You are really trying to make it not have reasoning for your own benefit

> You are really trying to make it not have reasoning for your own benefit

This whole thread really seems like it's the other way around. It's still very easy to make ChatGPT to spit out obviously wrong answers depending on the prompt. If it had actual ability to reason as opposed to just generating continuation to your prompt, the quality of the prompt wouldn't matter as much

GPT 3.5 is VERY dumb when compared to GPT 4. Like, the difference is massive.
GPT 4 still does a lot of dumb stuff on this question, you see several people post outright wrong answer and say "Look how gpt-4 solved it!". That happens quite a lot in these discussions, so it seems like the magic to get gpt-4 to work is that you just don't check its answers properly.
It's still a tool after all.

I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!

> It's still a tool after all.

I think that's the crux of the whole argument. It's an imperfect (but useful) tool, which sometimes produces answers that make it seem like it can reason, but it clearly can't reason on its own in any meaningful way

A smart hammer that sometimes unavoidably hits your thumb. How smart!
I ran this through GPT-4 Advanced Data Analytics version: https://chat.openai.com/share/b84feb03-22ed-4231-be41-cdb725...

Seems like it reasons it's way to this answer at the end to me: Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?

https://chat.openai.com/share/a9806bd1-e5a9-4fea-981b-2843e6...

Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.

>Let's assume you draw x blue balls in 100 draws. Then you would have drawn 100−x red balls.

Uhm.

I came at it from a different angle. The simulation code in my case had a bug which I needed to point out. Then it got a similar final answer.
It's not reasoning. It's word prediction. At least at the individual model level. OpenAI is likely using a collection of models.
ChatGPT is trained on text that includes most reasoning problems that people come up with.

You see reasoning issues when you use more real world examples, rather than theoretical tests.

I had 4 failure states.

1) Summarization: It summarized 3 transcripts correctly, for the fourth it described the speaker as a successful VC. The speaker was a professor.

2) It was to act as a classifier, with a short list of labels. Depending on the length of text, the classifier would swap over to text gen. Other issues included novel labels, new variations of labels, and so on.

3) Agents - This died on the vine. Leave having to learn asynch, vector DBs or whatever. You can never trust the output of an LLM, so you can never chain agents.

4) I focused on using ChatGPT to complete a project. I hadnt touched HTML ever - the goal was to use ChatGPT to build the site. This would cover design, content, structure, development, hosting, and improvements.

I still have trauma. Wrong code, bad design, were base issues. If code was correct, it simply meant I had dug a deeper grave. I had anticipated 70% of the work being handled by ChatGPT, it ended up at 30% at the most.

ChatGPT is great IF you already are a subject expert - you can brush over the issues and move on.

"Hallucinations" is the little bit of string that you pull on, and the rest unravels. There are no hallucinations, only humans can hallucinate - because we have an actual ground truth to work with.

LLMs are only creating the next token. For them to reason, they must be holding structures and proxies in some data store, and actively altering it.

Its easier to see once you deal with hallucinations.

What is your definition?
If it can solve basic logic problems, then it could reason. And if it could write code of a new game with new logic, then it could reason for sure.

Example of basic problem: In a shop, there are 4 dolls of different heights P,Q,R and S. S is neither as tall as P nor as short as R. Q is shorter than S but taller than R. If Kittu wants to purchase the tallest doll, which one should she purchase? Think step by step.

Really?
> And that's because, despite all of its training data, it's not capable of actually reasoning.

Your conclusion doesn't follow from your premise.

None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.

> None of these models are trained to do their best on any kind of test

How do you know GPT-4 wasn't trained to do well on these tests? They didn't disclose what they did for it, so you can't say it wasn't trained to do well on these tests. That could be the magic sauce for it.

They are trained to predict next tokens in a stream.

That is the learning algorithm.

The algorithm they learn, in response, is quite different. Since that learned algorithm is based on the training data.

In this case the models learn to sensibly continue text or conversations. And they are doing it so well it’s clear they have learned to “reason” at an astonishing level.

Sometimes, not as good as a human.

But in a tremendous number of ways they are better.

Try writing an essay about the many-worlds interpretation of the quantum field equation, from the perspective of Schrödinger, with references to his personal experiences, using analogies with medical situations, formatted as a brief for the Supreme Court, in Dr. Seuss prose, in a random human language of choice.

In real time.

While these models have some trouble with long chains of reasoning, and reasoning about things they don’t have experiences (different modalities, although sometimes they are surprisingly good), it is clear that they can also reason combining complex information drawn from there whole knowledge base much faster and sensibly than any human has ever come close to.

Where they exceed us, they trounce us.

And where they don’t, it’s amazing how fast they are improving. Especially given that year to year, biological human capabilities are at a relative standstill.

——

EDIT: I just tried the above test. The result was wonderful whimsical prose and references, that made sense at a very basic level, that a Supreme Court of 8 year olds would likely enjoy, especially if served along with some Dr. Seuss art! In about 10-15 seconds.

Viewed as a solution to an extremely complex constraint problem, that is simply amazing. And far beyond human capabilities on this dimension.

You are right that the process involves predicting words from training data. But you can still make training data focused on passing these tests. Adding millions of test questions to all of these to optimize for answering test questions is perfectly doable when you have the resources OpenAI has.

A strong hint to what they focused on in their training process is what metrics they used in their marketing of the model. You should always bet on models being optimized to perform on whatever metrics they themselves give you when they market the model. Look at the gpt-4 announcement, what metrics did they market? So what metrics should we expect they optimized the model for?

Exam results are the first metric they mentions, so exams was probably one of their top priorities when they trained gpt-4.

https://openai.com/research/gpt-4

Yes, absolutely. They can adjust performance priorities.

By the relative mix of training data, additional fine tuning training phases, and/or pre-prompts that give the model extra guidance relative to particular task types.

>The fact that they do well at all on tests they haven't seen

Haven't they seen these tests?

We know little to nothing of how these models get trained.

LLMs are trained to predict text, and one of the results of this is the LLM has as many "faces" as exist in the training data, so it's going to be _very_ different depending on the prompt. It's not a consistent entity like a human. RLHF is an attempt to mediate this, but it doesn't work perfectly.
I’m often confused over claims on the reasoning capabilities. It is often mentioned in debates as a clear and undeniable issue with current LLM’s. So since this claim can be made, where are said tests about reasoning skills that GPT-4 fails?

If it’s a debate on the illusion of reasoning, I’d be careful how I step here, because it’s been found these things probably work so well because the human brain is also a biological real-time prediction machine and “just” guessing too: https://www.scientificamerican.com/article/the-brain-guesses...

Isn't that the same as for humans? If you are speaking with me (prompting), my answers will be differents, based on how you prompted me for an answer.