Hacker News new | ask | show | jobs
by mg 983 days ago
I don't think the "hype" is built on test scores.

It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:

https://twitter.com/marekgibney/status/1403414210642649092

Nowadays, using it daily in a productive fashion feels completely normal.

Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like "Write an html page which lets me select an mp3, play it via play/pause buttons and offers me a field to enter a time to jump to". And the result was usable out of the box and is my default mp3 player now.

Two years ago it didn't even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.

It's hard to imagine where we will be in 20 years.

12 comments

The article doesn't say that LLMs aren't useful - the "hype" they mean is overestimating their capabilities. An LLM may be able to pass a "theory of mind" test, or it may fail spectacularly, depending on how you prompt it. And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future, but we're not there yet, and (AFAIK) nobody can tell how long it will take to get there.
> And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future [...]

I don't think so. When you say "it's not capable of actually reasoning", that's because it's a LLM; and if it "changes in the future", that's because the new system must no longer be a pure LLM. The appearance of reasoning in LLMs is an illusion.

How is the illusion of reasoning different from say “actual reasoning?”
Because it literally can't reason, and it also has no innate agency. Even the most dedicated creators of LLM-based AI technology have clearly and repeatedly stated that these are very sophisticated stochastic parrots with no sense of self. How much easier could it be to see that LLMs like GPT aren't actual thinking machines in the way we humans are?

Yes, many people reason based on pure pattern-matching and repeat opinions not because they've reasoned them but because they're what they've absorbed from other sources, but even the world's most unreasoned human being with at least functional cognition still uses an enormous amount of constant, daily, hourly self-directed decision-making for a vast variety of complex and simple, often completely spontaneous scenarios and tasks in ways that no machine we've yet built on Earth does or could.

Moreover, even when some humans say or "believe" things based on nothing more than what they've absorbed from others without really considering it in depth, they almost always do so in a particularly selective way that fits their cognitive, emotional and personal predispositions. This very selectiveness is a distinctly conscious trait of a self-aware being. Its something LLM's don't have as far as I've yet seen.

In the same way that illusions of anything else differ from the real thing. A wax apple is different from a real apple, even if it's hard to tell them apart sometimes. You may require further investigation to differentiate them (e.g., cutting open the apple or asking the AI to solve tricky reasoning questions), but if you can find a difference, there is a difference.
I have a hunch I am misunderstanding your argument, but does that mean the only way to build a "true reasoning machine" would be to just create a human.

I guess what I'm really asking, what would you expect to observe to make it not illusory?

To distinguish between "is an illusion" and "is not an illusion", you need evidence that isn't observational. The whole point of illusions is that observational evidence is unreliable.

A desert mirage in the distance is an illusion; to the observer, it's indistinguishable from an oasis. You can only tell that it's a mirage by investigating how the appearance was created (e.g. by dragging your thirsty ass through the sand, to the place where the oasis appeared to be).

If one has a reasonable understanding of 2 concepts that make up a larger system. And, such a system has little else in addition to those concepts, one is able to come up with that system by itself. Even though, it has never seen it, or their composition was never explained prior to that logical process.

The illusion happens when, clearly, the alleged reasoning behind how such a system comes to be is based on prior knowledge of the system as a whole. Meaning, its construction/source was within the training data.

I'm really curious, are you able to demonstrate reasoning, not reasoning and the illusion of reasoning in a toy example? I'd like to see what each looks like.
Have you met someone who is full of bullshit? They sound REALLY convincing, except if you know anything about the subject, their statements are just word salad?
Bullshit has an illusion of reasoning instead of actual reasoning. Basically you give arguments that sounds reasonable on the surface but there is no actual reasoning behind them.
> Bullshit has an illusion of reasoning instead of actual reasoning.

Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning? You could argue that bullshit is fallacious reasoning, "pseudo-reasoning" based on incorrect rules of inference.

But these models don't use any rules of inference; they produce output that resembles the result of reasoning, but without reasoning. They are trained on text samples that presumably usually are the result of human reasoning. If you trained them on bullshit, they'd produce output that resembled fallacious reasoning.

No, I don't think the touchstone for actual reasoning is a human mind. There are machines that do authentic reasoning (e.g. expert systems), but LLMs are not such machines.

> Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning?

None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.

Fallacious reasoning will make you wrong. No reasoning will make you spew nonsense. Truth and lies and bullshit, all require reasoning for the structure of what you're saying to make sense, otherwise it devolves to nonsense.

> But these models don't use any rules of inference

Neither do we. Rules of inference came from observation. Formal reasoning is a tool we can employ to do better, but it's not what we naturally do.

That's not the case. It's very much in the realm of "we don't know what's going on in the network."

Rather than a binary it's much more likely that it's a mix of factors going into results that includes basic reasoning capabilities developed from the training data (much like board representations and state tracking abilities developed feeding board game moves into a toy model in Othello-GPT) as well as statistic driven autocomplete.

In fact often when I've seen GPT-4 get hung up with logic puzzle variations such as transparency, it tends to seem more like the latter is overriding the former, and changing up tokens to emoji representations or having it always repeat adjectives attached to nouns so it preserves variation context gets it over the hump to reproducible solutions (as would be expected from a network capable of reasoning) but by default it falls into the pattern of the normative cases.

For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...

As an PhD student in NLP who's graduating soon, my perspective is that language models do not demonstrate "reasoning" in the way most people colloquially use the term.

These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Unfortunately, this means the "reasoning" exhibited by language models is limited: if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

That said, I do think adding reasoning capabilities is an active area of research, but we don't have a clear time horizon on when that might happen. Current prompting approaches are stopgaps until research identifies a promising approach for developing reasoning, e.g. combining latent space representations with planning algorithms over knowledge bases, constraining the logits based on an external knowledge verifier, etc (these are just random ideas, not saying they are what people are working on, rather are examples of possible approaches to the problem).

In my opinion, language models have been good enough since the GPT-2 era, but have been held back by a lack of reasoning and efficient memory. Making the language models larger and trained on more data helps make them more useful by incorporating more facts with increased computational capacity, but the approach is fundamentally a dead end for higher level reasoning capability.

Congrats on the upcoming PhD!

I'm curious where you are drawing your definition or scope for 'reasoning' from?

For example, in Shuren The Neurology of Reasoning (2002) the definition selected was "the ability to draw conclusions from given information."

While I agree that LLMs can only process token to token and that juggling context is critical to effective operation such that CoT or ToT approaches are necessary to maximize the ability to synthesize conclusions, I'm not quite sure what the definition of reasoning you have in mind is such that these capabilities fall outside of it.

The typical lay audience suggestion that LLMs cannot generate new information or perspectives outside of the training data isn't the case, as I'm sure you're aware, and synthesizing new or original conclusions from input is very much within their capabilities.

Yes, this has to happen within a context window and occurs on a token by token basis, but that seems like a somewhat arbitrary distinction. Humans are unquestionably better at memory access and running multiple subprocesses on information than an LLM.

But if anything, this simply suggests that continuing to move in the direction of multiple pass processing of NLP tasks with selective contexts and a variety of fine tuned specializations of intermediate processing is where practical short term gains might lie.

As for the issue of new domains outside of training data, I'm somewhat surprised by your perspective. Hasn't one of the big research trends over the past twelve months been that in context learning has proven more capable than was previously expected? I'd agree that a zero shot evaluation of a problem type that isn't represented in a LLMs training data is setting it up for failure, but the capacity to extend in context examples outside of training data has proven relatively more successful, no?

> These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Is it not possible that this is essentially how our brains do it too? Attempt to plan by branching out to related ideas until they contain an answer. Any of these statements that AI can't be on track to reason like a human because of X seem to come with an implication that we have such a good model of the human brain that we know it doesn't X. But I'm not an expert on neuroscience so in many of these cases maybe that implication is true.

>Is it not possible that this is essentially how our brains do it too?

Is that how you think? Just curious

I think the word "essentially" is important here. I don't think we can observe how we think. How it appears in consciousness is not necessarily real - it might be just a model constructed ex-post.

I do not know that much about AI but I know at least something about cognitive psychology and it seems to me that a lot of claims about LLMs "not actually reasoning" and similar are probably made by CS graduates who have unreflected assumptions about how human thinking works.

I don't claim to know how human thinking works but if there is one thing I would conclude from studying psychology and knowing at least some basics about neuroscience, it would be that "it's not how it appears to us".

Nobody knows how human reasoning actually works but if I had to guess (based on my amateurish mental model of the functioning of the human brain), I would say that it is probably a lot closer to LLMs and a lot less rational than is commonly assumed in discussions like this one.

I don’t think we are conscious about how the language center correlates with our memories and then predicts the strings of words coming out.
> if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

True. But look at the Phi-1.5 model - it punches 5x above its weight limit. The trick is in the dataset:

> Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+ 23]).

> We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.

https://arxiv.org/pdf/2309.05463.pdf

Synthetic data has its advantages - less bias, more diverse, scalable, higher average quality. But more importantly, it can cover all the permutations and combinations of skills, concepts, situations. That's why a small model just 1.5B like Phi was able to work like a 7B model. Usually at that scale they are not coherent.

Are you going to school in Langley, Virginia?
NSA is more commonly associated with Fort Meade, MD, for what that's worth.
> These models have no capacity to plan ahead

How would you describe the behavior of "GPT Advanced Data Analysis"?

> it's not capable of actually reasoning

Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.

I don't have access to GPT 4 but I'd be interested to see how it does on a question like this:

"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."

... because on GPT 3.5 the answer begins like the below and then gets worse:

"Let's break down the process step by step:

Initially, you have 50 red balls and 50 blue balls in the container.

1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."

If I was hiring interns this dumb, I'd be in trouble.

EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.

This is such a flawed puzzle. And GPT 4 answers it rightly. It is a long answer but the last sentence is "This is one possible scenario. However, there could be other scenarios based on the order in which balls are drawn. But in any case, the same logic can be applied to find the number of each color of ball left in the container."
The ability to identify that there isn't a simple closed form result is actually a key component of reasoning. Can you stick the answer it gives on a gist or something? The GPT 3.5 response is pure, self-contradictory word salad and of course delivered in a highly confident tone.
> The ability to identify that there isn't a simple closed form result is actually a key component of reasoning.

If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.

https://pastebin.com/r9bNi8GD

GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.

I asked GPT4 and it gave a similar response. So then I asked my wife and she said, "do you want more white balls at the end or not?" And I realized as CS or math question we assume that the draw is random. Other people assume that you're picking which ball to draw.

So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."

I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)

I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

Seems like quite a difficult question to compute exactly.

I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.

>I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

These are very good questions that anyone with the ability to reason would ask if given this problem.

"You're holding it wrong."

You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.

Instead, collaborate with it, while giving it the appropriate tools to help you.

I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.

I even got it to save a video of the plot rotating: https://streamable.com/2aphbz

AI can reason! Just not reasonably!
It can reason better than most humans put into the same situation.

This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)

With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!

Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.

This is what I got on a basically brand new OpenAI account: https://chat.openai.com/share/5199c972-478d-406f-9092-061a6b...

All told, I'd say it's a decent answer.

Edit: I took it to completion:https://chat.openai.com/c/6cdd92f1-487a-4e1c-ab94-f2bdbf282d...

These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.

GPT-4:

e composition of what's left in the container.

There's a couple of scenarios, which depend on when you run out of blue balls:

1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.

2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.

Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.

Yeah, that's still pretty much nonsense isn't it?

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.

Just because it fails one test in a particular way doesn’t mean it lacks reasoning entirely. It clearly does have reasoning based on all the benchmarks it passses

You are really trying to make it not have reasoning for your own benefit

GPT 3.5 is VERY dumb when compared to GPT 4. Like, the difference is massive.
GPT 4 still does a lot of dumb stuff on this question, you see several people post outright wrong answer and say "Look how gpt-4 solved it!". That happens quite a lot in these discussions, so it seems like the magic to get gpt-4 to work is that you just don't check its answers properly.
It's still a tool after all.

I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!

I ran this through GPT-4 Advanced Data Analytics version: https://chat.openai.com/share/b84feb03-22ed-4231-be41-cdb725...

Seems like it reasons it's way to this answer at the end to me: Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?

https://chat.openai.com/share/a9806bd1-e5a9-4fea-981b-2843e6...

Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.

>Let's assume you draw x blue balls in 100 draws. Then you would have drawn 100−x red balls.

Uhm.

I came at it from a different angle. The simulation code in my case had a bug which I needed to point out. Then it got a similar final answer.
It's not reasoning. It's word prediction. At least at the individual model level. OpenAI is likely using a collection of models.
ChatGPT is trained on text that includes most reasoning problems that people come up with.

You see reasoning issues when you use more real world examples, rather than theoretical tests.

I had 4 failure states.

1) Summarization: It summarized 3 transcripts correctly, for the fourth it described the speaker as a successful VC. The speaker was a professor.

2) It was to act as a classifier, with a short list of labels. Depending on the length of text, the classifier would swap over to text gen. Other issues included novel labels, new variations of labels, and so on.

3) Agents - This died on the vine. Leave having to learn asynch, vector DBs or whatever. You can never trust the output of an LLM, so you can never chain agents.

4) I focused on using ChatGPT to complete a project. I hadnt touched HTML ever - the goal was to use ChatGPT to build the site. This would cover design, content, structure, development, hosting, and improvements.

I still have trauma. Wrong code, bad design, were base issues. If code was correct, it simply meant I had dug a deeper grave. I had anticipated 70% of the work being handled by ChatGPT, it ended up at 30% at the most.

ChatGPT is great IF you already are a subject expert - you can brush over the issues and move on.

"Hallucinations" is the little bit of string that you pull on, and the rest unravels. There are no hallucinations, only humans can hallucinate - because we have an actual ground truth to work with.

LLMs are only creating the next token. For them to reason, they must be holding structures and proxies in some data store, and actively altering it.

Its easier to see once you deal with hallucinations.

What is your definition?
If it can solve basic logic problems, then it could reason. And if it could write code of a new game with new logic, then it could reason for sure.

Example of basic problem: In a shop, there are 4 dolls of different heights P,Q,R and S. S is neither as tall as P nor as short as R. Q is shorter than S but taller than R. If Kittu wants to purchase the tallest doll, which one should she purchase? Think step by step.

Really?
> And that's because, despite all of its training data, it's not capable of actually reasoning.

Your conclusion doesn't follow from your premise.

None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.

> None of these models are trained to do their best on any kind of test

How do you know GPT-4 wasn't trained to do well on these tests? They didn't disclose what they did for it, so you can't say it wasn't trained to do well on these tests. That could be the magic sauce for it.

They are trained to predict next tokens in a stream.

That is the learning algorithm.

The algorithm they learn, in response, is quite different. Since that learned algorithm is based on the training data.

In this case the models learn to sensibly continue text or conversations. And they are doing it so well it’s clear they have learned to “reason” at an astonishing level.

Sometimes, not as good as a human.

But in a tremendous number of ways they are better.

Try writing an essay about the many-worlds interpretation of the quantum field equation, from the perspective of Schrödinger, with references to his personal experiences, using analogies with medical situations, formatted as a brief for the Supreme Court, in Dr. Seuss prose, in a random human language of choice.

In real time.

While these models have some trouble with long chains of reasoning, and reasoning about things they don’t have experiences (different modalities, although sometimes they are surprisingly good), it is clear that they can also reason combining complex information drawn from there whole knowledge base much faster and sensibly than any human has ever come close to.

Where they exceed us, they trounce us.

And where they don’t, it’s amazing how fast they are improving. Especially given that year to year, biological human capabilities are at a relative standstill.

——

EDIT: I just tried the above test. The result was wonderful whimsical prose and references, that made sense at a very basic level, that a Supreme Court of 8 year olds would likely enjoy, especially if served along with some Dr. Seuss art! In about 10-15 seconds.

Viewed as a solution to an extremely complex constraint problem, that is simply amazing. And far beyond human capabilities on this dimension.

You are right that the process involves predicting words from training data. But you can still make training data focused on passing these tests. Adding millions of test questions to all of these to optimize for answering test questions is perfectly doable when you have the resources OpenAI has.

A strong hint to what they focused on in their training process is what metrics they used in their marketing of the model. You should always bet on models being optimized to perform on whatever metrics they themselves give you when they market the model. Look at the gpt-4 announcement, what metrics did they market? So what metrics should we expect they optimized the model for?

Exam results are the first metric they mentions, so exams was probably one of their top priorities when they trained gpt-4.

https://openai.com/research/gpt-4

Yes, absolutely. They can adjust performance priorities.

By the relative mix of training data, additional fine tuning training phases, and/or pre-prompts that give the model extra guidance relative to particular task types.

>The fact that they do well at all on tests they haven't seen

Haven't they seen these tests?

We know little to nothing of how these models get trained.

LLMs are trained to predict text, and one of the results of this is the LLM has as many "faces" as exist in the training data, so it's going to be _very_ different depending on the prompt. It's not a consistent entity like a human. RLHF is an attempt to mediate this, but it doesn't work perfectly.
I’m often confused over claims on the reasoning capabilities. It is often mentioned in debates as a clear and undeniable issue with current LLM’s. So since this claim can be made, where are said tests about reasoning skills that GPT-4 fails?

If it’s a debate on the illusion of reasoning, I’d be careful how I step here, because it’s been found these things probably work so well because the human brain is also a biological real-time prediction machine and “just” guessing too: https://www.scientificamerican.com/article/the-brain-guesses...

Isn't that the same as for humans? If you are speaking with me (prompting), my answers will be differents, based on how you prompted me for an answer.
> I was mesmerized by GPT-3's ability to understand concepts

This language embodies the anthropomorphic assumptions that the author is attacking.

Or the corollary: that there's really no such thing as anthropomorphic. There's inputs and outputs, and an observer's opinion on how well the outputs relate to the inputs. Thing producing the outputs, and the observer, can be human or not human. Same difference.
It absolutely is anthropomorphizing to claim "GPT-3's ability to understand concepts" rather than simply calling it "reproduce, mix and match text from an enormous corpus". And we can totally legitimately compare to a jury of human observers' opinions on how well(/badly) the output generated relates to the inputs.

For the specific example the OP cited "War: like being eaten by a dragon and then having it spit you out"

then unless its answer to "Where were you in between being eaten by a dragon and before it spat you out?" is "in the dragon's digestive system" that isn't understanding.

And I'm curious to see it answer "Dragons only exist in mythology; does your analogy mean war doesn't exist either? Why not compare to an animal that exists?"

> "War: like being eaten by a dragon and then having it spit you out"

This exact text, and the response (several attempts) is flagged and censored, with ChatGPT-4 web interface. :-|

Anyone who have used GPT-4 for any amount of time knows it is definitely not "reproduce, mix and match text from an enormous corpus". Also what's the point of the question. GPT has been specifically trained to not answer this. But in any case here is the answer with instruct model which generally is more open ended:

Sure, I can compare it to a real animal such as bears. Just like dragons are depicted as fearsome and destructive creatures in mythology, bears are also portrayed as powerful and dangerous animals in many cultures. However, just as the existence of dragons is a matter of fiction, the concept of a "war on bears" would also be nonsensical.

In reality, the issue of war is complex and multifaceted, involving political, economic, and societal factors. It cannot be reduced to a simplistic battle between humans and a particular animal species.

Furthermore, the use of animals in analogies can be problematic as it may trivialize the gravity and complexity of real-world issues. Wars and conflicts have serious consequences for human lives, whereas animals do not possess the same agency or ability to wage war.

In conclusion, while dragons may serve as a metaphor for war in mythology, comparing war to a fictional creature diminishes the seriousness of the topic and fails to capture the complexities of real conflicts.

>Also what's the point of the question. GPT has been specifically trained to not answer this. But in any case here is the answer with instruct model which generally is more open ended:

It would demonstrate basic reasoning skills that weren't things one would "reproduce, mix and match text from an enormous corpus". Like the response you provided, which is meaningless word salad. It's a prima facie takedown of your post.

This is like people who hate poetry, insisting their bad poetry is good poetry. Why? Because who else is to say otherwise! Well, the good poets. The people that appreciate poetry will know the difference. Everyone else wont care, save for those invested in having to sell their bad poetry as good.

What has poetry to do with reasoning? You should think GPT as a terse person who refuses this kind of thing. Certainly there are people like that who have good reasoning skill but can't answer your question in a poetic way(I being one).
Can AI people stop with the defense of "what if thing really is not a thing?" "what if thing is really what humans do?" These aren't answers to questions. Its deflecting nonsense posed as philosophical thought.
This.

We are in a Cambrian Explosion on the software side and hardware hasn’t yet reacted to it. There’s a few years of mad discovery in front of us.

People have different impressions as to the shape of the curve that’s going up and right, but only a fool would not stop and carefully take what is happening.

Exactly and things are actually getting crazy now. Pardon the tangent but for some reason this hasn't reached the frontpage on HN yet: https://github.com/OpenBMB/ChatDev

Making your own "internal family system" of AI's is a making this exponential (and frightening), like an ensemble on top of the ensemble, with specific "mindsets", that with shared memory can build and do stuff continuously. Found this from a comp sci professor on Tiktok so be warned: https://www.tiktok.com/@lizthedeveloper/video/72835773820264...

I remember a couple of comments here on HN when the hype began about how some dude thought he had figured out how to actually make an AGI - can't find it now, but it was something about having multiple ai's with different personalities discoursing with a shared memory - and now it seems to be happening.

This coupled with access to linux containers that can be spawned on demand, we are in for a wild ride!

I saw chatdev on hn and have been pretty disappointed with it :(

Haven’t had it make anything usable that’s more complicated than a mad lib yet

> If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

That's a big assumption to make. You can't assume that the rate of improvement will stay the same, especially over a period of 2 decades, which is a very long time. Every advance in technology hits diminishing returns at some point.

Why do you think so?

Technological progress seems rather accelerated than diminishing to me.

Computers are a great example: They have been getting more capable exponentially over the last decades.

In terms of performance (memory, speed, bandwidth) and in terms of impact. First we had calculators, then we had desktop applications, then the Internet and now we have AI.

And AI will help us get to the next stage even faster.

I’m not putting my coins on this advances.

More likely this will become the new “search” technology and get polluted with ads. People will lose trust and it will decay.

That is certainly where the economic incentives appear to be.
A lot of the progress in the last 3-4 years was predictable from GPT-2 and especially GPT-3 onwards - combining instruction following and reinforcement learning with scaling GPT. With research being more closed, this isn't so true anymore. The mp3 case was predictable in 2020 - some early twitter GIFs showed vaguely similar stuff. Can you predict what will happen in 2026/7 though, with multimodal tech?

I simply don't see it a being the same today. The obvious element of scaling or techniques that imply a useful overlap isn't there. Whereas before researchers brought together excellent and groundbreaking performance on different benchmarks and areas together as they worked on GPT-3, since 2020, except instruction following, less has been predictable.

Multi modal could change everything (things like the ScienceQA paper suggest so), but also, it might not shift benchmarks. It's just not so clear that the future is as predictable or will be faster than the last few years. I do have my own beliefs similar to Yann Lecun about what architecture or rather infrastructure makes most sense intuitively going forward, and there's not really the openness we used to have from top labs to know if they are going these ways, or not. So you are absolutely right that it's hard to imagine where we will be in 20 years, but in a strange way, because it is much less clear than in 2020 where we will be in 3 years time onwards, I would say it is much less guaranteed progress than it is felt by many...

I was also thinking about how quickly AI may progress and am curious for your or other people's thoughts. When estimating AI progress, estimating orders of magnitude sounds like the most plausible way to do it, just like Moore's law has guessed the magnitude correctly for years. For AI, it is known that performance increases linearly when the model size increases exponentially. Funding currently increases exponentially meaning that performance will increase linearly. So, AI will increase linearly as long as the funding does too. On top of this, algorithms may be made more efficient, which may occasionally make an order of magnitude improvement. Does this reasoning make sense? I think it does but I could be completely wrong.
You can check my post history to see how unpopular this point of view is, but the big "reveal" that will come up is as follows:

The way that LLMs and humans "think" is inherently different. Giving an LLM a test designed for humans is akin to giving a camera a 'drawing test.'

A camera can make a better narrow final output than a human, but it cannot do the subordinate tasks that a human illustrator could, like changing shadings, line width, etc.

An LLM can answer really well on tests, but it often fails at subordinate tasks like 'applying symbolic reasoning to unfamiliar situations.'

Eventually the thinking styles may converge in a way that makes the LLMs practically more capable than humans on those subordinate tasks, but we are not there yet.

Most of the improvements apparently come from training larger models with more data. Which is part of the problem mentioned in the article - the probability that the model just memorizes the answers to the tests is greatly increased.

AI is getting subjectively better, and we need better tests to figure out if this improvement is objectively significant or not.

> Most of the improvements apparently come from training larger models with more data.

OpenAI is reportedly losing 4 cents per query. With a thousandfold increase in model size, and assuming linear scale in cost, that's a problem. Training time is going to go up too. Moore's law isn't going to help any more. Algorithmic improvements may help...if any significant ones can be found.

That’s backwards.

Training a model on more data improves generalization not memorization.

To store more information in the same number of parameters requires the commonality between examples to be encoded.

In contrast, the less data trained on, especially if repeated, lets the network learn to provide good answers for that limited set without generalizing. I.e. memorizing.

——

It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.

The fewer examples, the more likely they just pattern match.

> It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.

> The fewer examples, the more likely they just pattern match.

A kid who uses a calculator and just fills in the answer to every question will see a lot more examples than a kid that learned by starting from simple concepts and understanding each step. But the kid who focused on learning concepts and saw way fewer problems will obviously have a better understanding here.

So no, you are clearly wrong here, humans doesn't learn that way at all. These models learn that way, you are right on that, but humans don't.

I have no idea where your calculator came from.

In neither case did I introduce one.

And since the calculator itself has already a general understanding, it would seem completely counter productive to start training a computer or child by first giving them a machine that has already solved the problem.

Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.

Replace "uses calculator" to "looks through solved problems", same thing. Not sure what you don't understand. Humans don't build understanding by seeing a lot of solved examples.

To make a human understand we need to explain how things work to them. You don't just show examples. A human who is just shown a lot of examples wont understand much at all, even if he tries to replicate them.

> Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.

What does this has to do with how humans learn?

Humans learn vast amounts of information from examples.

They learn their first words, how to walk, what a cat looks like from many perspectives, how to parse a visual scene, how to parse the spoken word, interpret facial expressions and body language, how different objects move, how different creatures behave, different materials feel, what things cause pain, what things taste like and how they make them feel, how to get what they want, how to climb, how not to fall, all by trial & example. On and on.

And yes, as we get older we get better and better at learning 2nd hand from others verbally, and when people have the time to show us something, or with tools other people already invented.

Like how a post-trained model picks up on something when we explain it via a prompt.

But that is not the kind of training being done by models at this stage. And yet they are learning concepts (pre-prompt) that, as you point out, you & I had to have explained to us.

I pretty much want the LLM to be great at memorizing things. That's what I'm not great at.

If it had perfect recall I would be so thrilled.

And just because it's memorized the data--as all intelligences would need to do to spit data out--doesn't mean it can't still do useful operations on the data, or explain it in different words, or whatever a human might do with it.

Do we? I use gpt-4 daily and it matters not to me what the source of the "intelligence" is. It's subjective what "intelligence" even means. It's subjective how the brain works. Almost by definition AI is "things that can't be objectively measured".
What's the benefit of doing this vs copying one of the many (far superior) Javascript mp3 players on the internet, such as here?

https://freefrontend.com/javascript-music-players/

It'd be a bit faster to get up and running with ChatGPT. In the AI, you'd have to phrase the instruction & copy the output into a file. For search, you have to do both those things and learn a UI that wasn't built to taste.
Almost nothing happened in AI for about 50 years. That's the normal in the field.
I got curious and did this myself. Needed a bit of nudging to get where I wanted, but I even had it make an Electron wrapper:

https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...

This is awesome, thanks for sharing.

Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.

Or is that effectively what Copilot/cursor do and I’m just a bad operator?

> Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.

ChatGPT does this.

No problem, it was a fun morning exercise for me :)

Copilot, at least from what little I did in vscode, isn't as powerful as this. I think there's a GPT4 mode for it that I haven't played with that'd be a lot closer to this.

I used gpt4 to write a script that I can ssh from my iPhone to a m1 that downloads the mp3 from a yt url on my iPhone clipboard. The only thing I am missing is automating the sync button when the iPhone is on the same home wifi to add the mp3 to the music app.
> Two years ago it didn't even dawn on me that this would be my way of writing software in the near future

So you were ignorant two years ago, GitHub Copilot was already available to users back then. The only new big thing the past two years was GPT-4, and nothing suggest anything similar will come the next two years. There are no big new things on the horizon, we knew for quite a while that GPT-4 was coming, but there isn't anything like that this time.

Copilot was not around when I wrote the Tweet.

But when Copilot came out, I was indeed ignorant! I remember when a friend showed it to me for the first time. I was like "Yeah, it outputs almost correct boilerplate code for you. But thankfully my coding is so that I don't have to write boilerplate". I didn't expect it to be able to write fully functional tools and understand them well enough to actually write pretty nice code!

Regarding "there isn't anything like that this time." : Quite the opposite! We have not figured out where using larger models and throwing more data at them will level off! This could go on for quite a while. With FSD 12, Tesla is already testing self driving with a single large neural net, without any glue code. I am super curious how that will turn out.

The whole thing is just starting.

Well, my point is that you perceive progress to be fast since you went from not understanding what existed to later getting in on it. That doesn't mean progress was that fast, it means that you just discovered a new domain.

Trying to extrapolate actual progress is bad in itself, but trying to extrapolate your perceived progress is even worse.

Yeah you have hit the nail on the head here. A lot was predictable with seeing that GPT-2 could reasonably stay within language and generate early coherent structures, that coming at the same time as instructions with the T5 stuff and the widespread use of embeddings from BERT told us this direction was likely, it's just for many people this came to awareness in 2021/22 rather than the 2018-2020 ramp up the field/hobbyists experienced.
Whisper, Stable Diffusion, VoiceBox, GPT4 vision, DALL.E3

Other breakthroughs in graph machine learning https://towardsdatascience.com/graph-ml-in-2023-the-state-of...

Those are image/voice generation, the topic is about potential replacement of knowledge workers such as coders. The discussion about image/voice generation is a very different topic since nobody thinks those are moving towards AGI and nobody argued they were "conscious" etc.