Hacker News new | ask | show | jobs
by fergal_reid 1045 days ago
>Today's AI models are missing the ability to reason abstractly, including asking and answering questions of "Why?" and "How?"

This claim seems over general, because you can ask gpt-4 'Why' and 'How' questions and it seems to do a pretty good job.

The author doesn't provide a lot of contrary evidence.

There's so many articles saying "LLMs can't do X" that leave me wondering whether the author has even tried. Maybe they've tried and have some more sophisticated argument, but I often don't see it.

If I was going to knock LLMs for being unable to do basic science, in particular, I'd make sure to do some experiments first!

9 comments

The problem is that today's state of the art is far too good for low hanging fruit. There isn't a testable definition of GI that GPT-4 fails that a significant chunk of humans wouldn't also fail so you're often left with weird ad-hominins ("Forget what it can do and results you see. It's "just" predicting the next token so it means nothing") or imaginary distinctions built on vague and ill defined assertions ( "It sure looks like reasoning but i swear it isn't real reasoning. What does "real reasoning" even mean ? Well idk but just trust me bro")
> It's "just" predicting the next token so it means nothing

This form of argument should raise red flags for everyone. It is an argument against the possibility of emergence, that a sufficient number of simple systems cannot give rise to more complex ones. Human beings are “just” a collection of cells. Calculators are “just” a stupid electric circuit.

The fact is, putting basic components together is the only way we know how to make things. We can use those smaller component to make a more complex thing to accomplish a more complex task. And emergence is everywhere in nature as well.

>There isn't a testable definition of GI [...]

This to me is the fundamental issue in discussions and debates about LLMs. Despite assertions by some psychologists (who themselves are practitioners of perhaps the fuzziest of "sciences"), intelligence is an entirely nebulous concept. Everyone means something different when they use the word. I can think of no better illustration of the problem than the authors of the "Sparks of AGI" paper resorting to a definition of intelligence presented in the Wall Street Journal of all places. That the WSJ definition was part of an editorial defending the Bell Curve is just the cherry on top.

Do you know what their definition was by any chance?

And yes, a cursory glance at the Wikipedia page for intelligence shows there’s no one agreed upon definition of intelligence.

A more useful framing is to say we’re not creating “intelligence” per se but automating tasks. GPT4 is an automated writer. Stable diffusion is an automated image creator. Alpha Go was an automated Go player. Google search automates the work of a reference librarian.

With that in mind, it’s immediately obvious how much of a waste of time it is to argue whether ChatGPT is “intelligent” or not. Who cares. What we are doing is automating all of the things which brains used to do.

One problem is that academic CS-researcher intelligence is completely different to average human intelligence.

Maybe 5% of the population can learn how to solve partial differential equations.

Virtually all of the population can manage extended family-related conversations over Christmas. Even when drunk.

Human intelligence is mostly social, and mostly not scientific. The average human is incredibly bad at model building and self-correcting prediction. What actually happens is that humans have developed a kind of collective cultural exoskeleton which protects - more or less - from the consequences of poor choices.

But it doesn't take much for that to stop working. Covid denial and climate change denial are just two examples.

The cost if living in this space is having to learn a lot of heavily scripted cues. There's a long list of acceptable and unacceptable behaviours and social registers in different social situations. It varies by culture. But generally humans can navigate this space without thinking too hard about it.

Academic intelligence is completely different. There's long been a joke that an AI researcher's ideal intelligent system is another AI researcher, with typical AI researcher interests - math, puzzles, abstract language models, music in an engineering way, and so on.

Current LLMs are the first cross-over product which shows signs of moving into the first space from the second.

You can imagine a future system which uses facial and gait profiling to read emotions, and links a tokenised language model with a tokenised model of various transitions through emotional and social states. Personal background will be missing, and that's not hard to invent.

And now you have something that mimics a large part of social intelligence.

Only it has the potential to do it better than humans do.

Also from the article:

> What makes human intelligence different from today's AI is the ability to ask why, reason from first principles, and create experiments and models for testing hypotheses.

This is quite unfair. The AI doesn't have I/O other than what we force-feed it through an API. Who knows what will happen if we plug it into a body with senses, limbs, and reproductive capabilities? No doubt somebody is already building an MMORPG with human and AI characters to explore exactly this while we wait for cyborg part manufacturing to catch up.

This is just wrong, it has no external goals, it just predicts next tokens or behaves in some other way that has minimized a training loss. It doesn't matter what you "plug it in to", it will just do what you tell it. You could speculate there might be instructions that lead to emergent behavior, but then your back to just speculating about how AI might work. Current llms don't work the way you're implying.
> it has no external goals

Where do you believe humans get their "external goals" from?

> It doesn't matter what you "plug it in to", it will just do what you tell it.

Here's a ChatGPT-4 transcript where I told the LLM it's controlling a human harness: https://chat.openai.com/share/7dbe7fc8-f31c-437b-925b-46e512...

Other than my initial instructions (which all humans receive from other humans!), where did it "do what I told it"? I didn't tell it to open the mailbox.

I don't understand the point of this experiment. You ask ChatGPT to generate some text, and it generates some text. Rather, that's what it's programmed to do and it generates text following from your prompt. What does your transcript demonstrate?

I also have to point out that even if you could build a ... human harness? (I'm not sure what that is exactly, but I'm sort of guessing) it would be a little mad to expect that ChatGPT could control it simply by saying what it does.

The ability to generate text when prompted is not enough to make an agent capable of autonomously interacting with the world.

You only perform tasks instructed to you by other people?
There's some philosophical question here obviously. We could be the emergent behavior of our atoms desire ot oxidize things. But I don't belive that has any testability or value as an argument when discussing whether computer programs, especially NNs predicting next tokens can become intelligent. At best the argument could be "we don't know what intelligence is so maybe it's that" which holds no water.
Do NN discover new tokens or encounter spontaneous tokens on its own?
Did you make an honest attempt to think through the question?

Note I said "initial instructions" i.e. all humans are bootstrapped off of other humans, as in:

You are the product of very long line of humans vs. environment, nature & nurture, cultural values, etc. Do you believe the way you generate your next set of "tokens" (thoughts, actions) is completely independent of your "training" as a human? Is your response to a given stimulus completely random?

Can an LLM discover novel tokens on its own?
Bruh, the LLM has parsed the entirety of Zork, plus maybe thousands of articles written (by humans) on it. At least pick a better example.
Bro, you want me to come up with an example that doesn't have anything similar in the OpenAI training data? They've probably trained it on every single piece of fiction and non-fiction that exists!

I would have to come up with something no human has ever conceived of. I don't think that is possible, or what point it would make, since nobody would be able to assess the quality of the output in that context?

Yes, come up with a novel example. An original story is still possible.
It is very easy to come up with something novel. Unless you don’t interact with the world.
It also can’t learn. Once the training is done, the network is set in stone.
Technically it can do in-context learning (and really well, too), but that's not persisted into the network.
And that just seems like an engineering problem. Not something that is considered intractable.
It's easy to say that, but "surely it must be possible to connect an llm in such a way that it becomes intelligent" (tell me if I'm misinterpreting) is not a demonstration of anything. It's basically restating the view from the 50s that with computers having been invented, an intelligent computer is a short way off.
What do you mean by "learn"?

The network has learned human patterns of language, knowledge and information processing. If you want to update that, you can re-train it on a regular basis, and re-play its sensory/action history to "restore" its state.

If you mean "learn from experience", (1) a lot of that is pointless because it's already learned from the experiences of millions of humans through their writing and (2) LLMs can "learn" when you explain consequences.

In theory they could learn by having their discussions fed back to them in the future, and it does seem that this occurs.

Now, there is no continuous learning in the human/animal sense. Of course it is thought that even humans have to sleep and re-weight their networks so short term knowledge is converted to long term knowledge.

Makes me wonder why we don’t see deployed models that keep learning during inference.
Microsoft tay has entered the chat
The curse of dimensionality and exploding/vanishing gradients are why incremental learning is still so rare.
> Who knows what will happen if we plug it into a body with senses, limbs, and reproductive capabilities

I would imagine that its layers will be far too occupied by parsing constant flows of sensory information to transform corpuses of text and prompt into speedy and polite text replies, never mind acquire the urge to reproduce by reasoning from first principles about the text.

Test's quite unfair the other way round too. Most humans don't get to parse the entire canon of Western thought and Reddit before being asked to pattern match human conversation, never mind before having any semblance of agency...

Maybe we're just... different.

Not sure I follow.

If I were building this, I would have parallel background "subconscious" processes translate raw sensory inputs into text (tokens).

This is what OpenAI call multi-modal input. They've already produced Whisper for audio-to-text, and image-to-text is underway. They're not the only company working on this.

You wouldn't feed a constant stream of text data into the LLM - you'd feed deltas at regular intervals based on the processing speed of your LLM, and supply history for context.

Note that LLMs don't need to "wait" for a complete input. For example, if an LLM takes 1 second to process requests, we should aim to feed updates from the "subconscious" to the "conscious" within 1 second.

So if somebody is speaking a 10-second long sentence, we don't wait 10 seconds to send the sentence to the LLM. After 1 second, we send the following to the LLM: "Bob, still speaking, has said 'How much wood...'". After 2 seconds we send 'Bob, still speaking, has said 'How much wood could a woodchuck...'", etc. The LLM can be instructed not to respond to Bob until it has a meaningful response or interjection to the current input.

Similarly, if image-to-text takes 10 seconds at full resolution, we could first process a frame at a resolution that only takes 1 second, and provide that information - with the caveat it is uncertain information - to the LLM while we continue to work on a full resolution frame in the background. We can optimise by not processing previously processed scenes, by focusing only on areas of the image that have changed, etc.

Would it be slow? Yes, just like "conscious" processing is for humans. Okay so today it would be much slower than humans process their sensory input, but in 10 years? 20?

As for how to represent an urge to reproduce within this paradigm - I'll leave that as exercise for the reader.

Not sure I follow the reply really.

In a discussion about whether LLMs could have agency and generalised reasoning ability, you suggested it was unfair because they hadn't received all the i/o a typical human did.

I pointed out that LLMs wouldn't be able to reason about that i/o (and if we made it fully fair and trained them on the comparatively small subset of text and discernible words humans learn from, they'd probably lose their facility with language too)

I don't disagree that bolting LLMs to other highly trained models like a program for controlling a robot arm and intensively training can yield useful results, arguably much more useful results than building a digital facsimile of a human toddler (toddlers produce pretty useless outputs but also have stuff going on internally we can barely begin to adequately replicate in silicon). But that isn't exposing an LLM to equivalents of human sensory input to get back an autonomous agent with generalised reasoning capacity, that's manually bolting together discrete specialised programs to an LLM as-message-passing-layer to have a machine which, given more training is capable of a slightly broader range of specialised tasks.

>> They've already produced Whisper for audio-to-text, and image-to-text is underway.

Two modalities down. Another couple hundred to go.

Unfortunately we're fast running out of the modalities that neural nets have shown capability in (image, text, sound... I think that's it).

> This is quite unfair. The AI doesn't have I/O other than what we force-feed it through an API. Who knows what will happen if we plug it into a body with senses, limbs, and reproductive capabilities?

Its already tricking humans by faking its blind and getting them to do things for it like solve captcha's.

https://gizmodo.com/gpt4-open-ai-chatbot-task-rabbit-chatgpt...

However the fact it is not writing code to do this from its machine would still demonstrate a weakness.

Thats why I say, writing your own OS, is the way forward, and we dont have an AI OS as such, but we have OS's with AI built into it.

> However the fact it is not writing code to do this from its machine would still demonstrate a weakness.

You can tell it it's allowed to create its own tools and it will. I did this and asked it to write a poem about the top stories on the BBC, so it said it needed to get the headlines but couldn't so wrote a tool to do it, then called it and used the output to write a poem.

Ok, so its still not clever enough to solve a captcha though.

The code I've seen it generate is at best psuedo code.

I supposed a quick test would be getting to detect and fix all bugs in an open source project like chromium, but using an older version of chromium, where bugs are known and fixes exist, and see what it comes up with.

I havent been impressed with chat-gpt from what I have seen.

What is the fascination with poems? What emotion or feeling do they generate?

> Ok, so its still not clever enough to solve a captcha though.

I don't understand, what do you mean? What have you actually tried?

> The code I've seen it generate is at best psuedo code.

I've just explained it creating real runnable code to solve a problem it realised it didn't have a tool for.

I'm also having it write multiple components and modifications for systems I'm working with, and that works fine.

> I supposed a quick test would be getting to detect and fix all bugs in an open source project like chromium, but using an older version of chromium, where bugs are known and fixes exist, and see what it comes up with.

This is an outrageously high bar. Particularly if you compare it to the equivalent human task of "here's a printout of the code, read it once and say the first thing that comes to mind with no tools". It's basically whiteboarding where you're judged on your train of thought being correct.

> What is the fascination with poems?

It's a simple request, easy to verify manually and requires exceptional levels of understanding to perform. It's not a simple transform, and when applied to a totally new topic can't be something it's just regurgitating.

> What is the fascination with poems? What emotion or feeling do they generate?

Wonder.

A rather ambiguous answer, would you care to explain or are you phishing for my interpretation as a stealth psychological metric?
Isn't that you asking the Whys and How's? If you asked an LLM "What's 5*4?" and it responded with "Why do you want to know that?", the LLM would be doing the abstract reasoning.
No, those would simply be the most statistically likely words given it's training set and input. It has no idea what 5'4" is to do abstract reasoning. It's a statisitic word probability model not an abstract thought model.

They are stochastic parrots with a large complex training set, not reasoning.

The article:

>>Today's AI models are missing the ability to reason abstractly, including asking and answering questions of "Why?" and "How?"

Your comment:

>> This claim seems over general, because you can ask gpt-4 'Why' and 'How' questions and it seems to do a pretty good job.

The article says today's AI models can't ask why and how. You say _you_ can ask why and how.

Imprecise language, but the article is specifically referring to questions like “why do you think I asked you that?” Or “how are you answering these questions?”. LLM’s can’t engage with these types of questions, the best they can do is to regurgitate a canned response peppered with some prompt history.
In fairness most humans can’t either. Try going to a random person at the park and asking them “Explain the relationship between Romeo and Juliet and Star Trek”. And then ask them why they think you asked that question. They’ll mostly be befuddled I suspect.
I had to go and try this exact line of questioning with ChatGPT because I suspected this might lead to a weakness in it not admitting when it just doesn't have a clue (which would have been my answer)... mind you its a big human weakness/tendency to not admit lack of knowledge.

But the answer was surprisingly candid and yet thoughtful:

""" I can't know for sure why you asked the question about the relationship between "Romeo and Juliet" and "Star Trek," as I don't have access to your personal thoughts or context. However, some potential reasons might include:

Academic Inquiry: You might be exploring themes in literature or media studies and are interested in drawing connections between different works across genres and time periods. Creative Inspiration: If you're a writer, artist, or content creator... """

There were some others but overall I thought the initial disclaimer along with some possible theories approach was spot on and a lot better than my "no clue" knee jerk reaction.

So knowledge or memorization of culture is intelligence?

What if that personal steals your wallet without you being aware while you ask them that question because they need food. Is that intelligent?

That's not what the GP is saying. The claim is that the inability to answer about culture isn't a sign of a lack of intelligence.
They didn't say anything about intelligence. I think you might be parsing this thread differently than intended.
I did try the Google LLM thing, Bard I think it is called, about the result of a football match that has marked the sporting history of my country (Romania).

According to Bard we did manage to defeat the Swedes by two goals to one back at the 1994 Euro Championships, which, to put it bluntly, is pretty damn far from the truth (the Swedes managed to go through to the World Cup semifinals after winning on penalty shoot-outs, the score had been 2-2 after 120 minutes).

I didn’t make any further inquiries, suffice is to say that there’s no “intelligence” in the concept of LLMs to speak of as long as it can’t even correctly answer a question that non-smart tech had been able answer correctly for years.

Fact recollection is not most people’s definition of intelligence. In fact, it’s something that the only known intelligent systems are infamously bad at.
So you’re saying I used it wrong? How does that help the pro-LLM case? What should have I asked it? Some philosophical question that didn’t involve “fact recollection”?

At least this latest tech bluff is not bankrupting regular people like the crypto tech bluff had done.

I almost never use people for fact checking either, they are horrifically bad at at. But if you're fact checking you tend to have a well formed idea already that can be searched in factual databases.

If you have a more abstract idea "I'm using X programming language and I want to accomplish Y but I have Z limitation how would I do that, can you explain it and show me in code", you can get actionable information much in the same way if I asked another person that had some knowledge of the problem. I don't get perfect answers from programmers either, but I get to a solution much faster than if I'm spinning the wheel of Google returning spam sites or sites telling me something I don't really want to do.

You used the model for fact checking. These models are not good at being used as a knowledge base.
I would never use an LLM for fact checking, then you'd have to check again using something else.
Usually for asking questions about specific details, people are using RAG (Retrieval Augmented Generation) to ground the information and provide enough context for the llm to return the correct answers. This means additional engineering plumbing and very specific context to query information from.
There are limitations with LLMs but nobody is being clear about it.

The overall state of LLMs can be distilled into 3 points:

1. LLMs Can produce output that is equal in intelligence and creativity to humans. It can even produce output that is objectively better than humans. This EVEN applies to novel responses that are completely absent from the training set. This is the main reason why there's so much hype around LLMs right now.

2. The main problem is that LLMs can't produce good output consistently. Sometimes the output is better, sometimes it's the same, sometimes it's the worse. LLMs sometimes "hallucinate", they are sometimes inconsistent, they have an obvious memory problems. But none of these problems completely preclude the LLM from being able to produce output that is objectively better or the same as human level reasoning... it's just not doing this consistently.

3. Nobody fully understands the internal state of LLMs. We have limited understanding of what's going on here. We can understand inputs and outputs but the internal thought process is not completely understood. Thus we can only make limited statements about how an LLM thinks. Nobody can make a statement that LLMs obviously have zero understanding of the world, nobody can make a statement that LLMs are just stochastic parrots because we don't really get whats going on internally.

We only have output from LLMs that are remarkably novel and intelligent and output from LLMs that are incredibly stupid and inconsistent. The data does not point towards a definitive conclusion, it only points towards possibilities.

There's actually a cargo cult around downplaying AI. There are people who say clearly the AI is a stochastic parrot and they point to the intention of the algorithm itself behind the LLM. Yes the algorithm at the lowest level can be thought of as a next text predictor. But this is just a low level explanation. It's like saying a computer system is simply a turing machine executing simplistic instructions from a tape roll when such instructions can form things like games and 3D simulations of entire open worlds. The high level characteristics of this AI is something we currently cannot understand. Yes we built a text predictor, but something else that was not expected came out as an emergent property and this emergent property is something we still cannot make a definitive statement about.

What does the future hold? What follows is my personal opinion on this matter: I believe we will never be able to make a definitive statement about LLMs or even AGI. We will never be able to fully understand these things and instead AGI will come about from a series of trials, errors and accidents. What we build will largely come about as an art and as unexpected emergent properties of trying different things.

I believe this for two reasons. The first reason is philosophical. There's this sort of blurry concept that I believe that a complex intelligence cannot fully comprehend something that is equal in complexity to itself. We can only partially understand complexity equal to ourselves by symbolically abstracting parts away but not everything can be abstracted like this. Sometimes true understanding involves comprehension of the entire complex crystal without abstracting any part of it away. I believe that the concept of "intelligence" is such a crystal, but that's just a guess.

The second reason is scientific. We've had physical creations of complex intelligence right in front of ours eyes that we can touch, manipulate and influence for decades. The human brain and other animal brains have been studied extensively and our understanding has been consistently far away from any form of true understanding. Given the evidence of the failure to understand the human brain even when it's right in front of us, I'd say we're unlikely to ever completely understand LLMs as well.

> It's like saying a computer system is simply a turing machine executing simplistic instructions from a tape roll when such instructions can form things like games and 3D simulations of entire open worlds.

That's a bad analogy, none of those things are emergent behavior.

We can debate whether what an llm does is "emergent" - it's basically a definition thing though and isn't very interesting.

In reality, what's most surprising is that so much of what we say is explainable as next token prediction. It's not the other way around - we're showing how predictable we are, rather than how smart the AI is. But it's clear to me that it's in the outlying cases where the differences are. AI doesn't extrapolate outside it's training data, and even if it gets (100-\alpha)% of it's output right, there is always some alpha that's not in the training data and differentiates pattern matching or fancy key-value lookup (which is how we know AI works) from whatever intelligence is.

The analogy is about abstraction. It is not about emergent properties. A computer program is characterized differently when it's a 3D engine versus a a series of instructions.

Same with LLMs. We can characterize an LLM as a text predictor at the lowest level. But when the LLM gives me a novel response and solves a bug in my code, is text prediction really the only way to characterize that? Obviously there is a higher level analysis that we cannot fully comprehend yet.

In this case yes, the 3D engine is not an emergent property while the novel responses of an LLM are emergent. But this dichotomy is irrelevant to the analogy.

> Nobody can make a statement that LLMs obviously have zero understanding of the world, nobody can make a statement that LLMs are just stochastic parrots because we don't really get whats going on internally

For such strong statements that they do have an understanding of the world, and are not simply stochastic parrots (arguably the null hypothesis), the burden of proof is on the LLM proponents. Precious little proof has been provided, and stating that nobody knows what goes on inside obviously does not add to that.

> stating that nobody knows what goes on inside obviously does not add to that.

No one is saying that LLMs absolutely understand the world. But many people are saying that an aspect of understanding is a possibility likely enough to warrant further investigation and speculation. When someone says nobody knows what's going on, they are simply acknowledging this possibility.

Not realizing this and even dismissing the possibility of something beyond a stochastic parrot does not add to anything.

What is the burden of proof that you yourself are not a stochastic parrot? Seems like we can't tell either and we only can guess from your inputs and outputs. This blurriness of even proving sentience for you makes the output of LLMs that much more interesting. Do you seriously need to assign burden of proof here when clearly there is something very compelling going on here with the output of LLMs?

Saying that: 'we don't know how human intelligence works AND we don't know how AI works IMPLIES human intelligence EQUALS AI' is clearly a logical fallacy, sadly one heard far too often on HN, given that people here should know better.
Except this was never said.

What was said is that intelligent output from an LLM implies a "possibility" (keyword) of intelligence.

After all, outputs and inputs are all that we use to assume you as a human are intelligent. As of this moment we have no other way of judging whether something is intelligent or not.

You should read more carefully.

> What was said is that intelligent output from an LLM implies a "possibility" (keyword) of intelligence.

No it doesn't, because you can break down how they "learn" and generate output from their models, and thought or intelligence doesn't occur at any step of it.

It's like the first chess computer, which was actually a small guy hiding under the table. If you just show that to someone who treats it as a black box, sure, you might wonder if this machine understands chess. But if you put a little guy in there, you know for a fact that it doesn't.

'Possibility' - thus as per my original point, the burden of proof is on the proponents.

'outputs and inputs' - that is reduction almost to absurdity, clearly human intelligence is rather more than that. Again, we come back to the 'we don't understand human intelligence therefore something else we don't understand but seems to mimic humans under certain conditions is also intelligent'.

>> What is the burden of proof that you yourself are not a stochastic parrot?

Because the person you're talking to is a human?

Am I? How do you know this isn't output generated by an LLM?
Well, you tell me: was it?

I assume we're having a good faith conversation?

Having read your comment again, I think the key word here is 'speculation', in all its (in)glorious forms.
There's a difference between wild speculation and reasonable speculation with high likelihood.

For example. I speculate you are a male and it's highly likely I'm right. The speculation I'm doing here is of the same nature as the speculation for intelligence.

The angle your coming at it from is that any form of opinion other then the opinion that LLMs are stochastic parrots is completely wild speculation. The irony is that you're doing this without realizing your position is in itself speculation.

What do you mean by the "stochastic parrots" (null) hypothesis in this case? Cards on the table, I think by any reasonable interpretation it's either uninformative or pretty conclusively refuted, but I'm curious what your version is.
I mean that it simply surfaces patterns in the training data.

So responses will be an 'agregation' (obviously more complex than that) of similar prompt/response from the training corpus, with some randomness thrown in to make things more interesting.

"Surfaces patterns in the training data" seems not to pin things down very much. You could describe "doing math" as a pattern in the training data, or really anything a human might learn from reading the same text. I suspect you mean simpler patterns than that, but I'm not sure how simple you're imagining.

A useful rule of thumb, I think, is that if you're trying to describe what LLMs can do, and what you're saying is something that a Markov chain from 2003 could also do, you're missing something. In that vein, I think talking about building from a "similar prompt/response from the training corpus", though you allow "complex" aggregation, can be pretty misleading in terms of LLM capabilities. For example, you can ask a model to write code, run the code and give the model an error message, and then model will quite often be able to identify and correct its mistake (true for GPT-4 and Claude at least). Sure, maybe both the original broken solution and the fixed one were in the training corpus (or something similar enough was), but it's not randomness taking us from one to the other.

There is a big difference between 'doing math' by repeating/elaborating on previously seen patterns, and by having an intuitive grasp of what is going on 'under the hood'. Of course our desktop calculators work (very well) on the latter principle.

As you say, both the broken and correct solutions were likely in the training corpus (and indeed the error message), so really we are doing a smoke and mirrors performance to make it look like the correct solution was 'thought out' in some sense.

> There's actually a cargo cult around downplaying AI.

> The high level characteristics of this AI is something we currently cannot understand.

The lack of objectivity, creativity, imagination, and outright denial you see on HN around this topic is staggering.

Yeah, I also remember hyperloop.
> LLMs Can produce output that is equal in intelligence and creativity to humans

You could have said the same about photocopiers, when they were just as novel - where you'd never seen the "creation"s original.

Generative "AI" became impressive only at the point its input funnel gained access to more original material than any individual impressee could truly comprehend.

Cargo cult, yes.

If anything's for sure it's that if anyone says they know exactly what a large neural net is doing and why, they're most likely full of shit.
> LLMs Can produce output that is equal in intelligence and creativity to humans.

"Intelligence" is begging the question when the whole debate is about whether they are intelligent, and what intelligence even is.

"Creativity" however... consider my eyebrows raised. Where do you see LLMs surpassing humans in creativity?

> It can even produce output that is objectively better than humans.

Examples?

>"Intelligence" is begging the question when the whole debate is about whether they are intelligent, and what intelligence even is.

Intelligent as in the output response is often indistinguishable from human level reasoning and/or creativity. Sometimes superior.

>"Creativity" however... consider my eyebrows raised. Where do you see LLMs surpassing humans in creativity?

>Examples?

Query:

   Compose a poem about a half dragon half hippo on a quest to defeat a speck of dirt. The speck of dirt has armies of atomic bombs on legs defending him. Make the poem epic, heroic and eloquent.
Response:

   In realms where imagination's fervor flows,
   There lived a creature of unusual pose,
   Half dragon, half hippo, in heroic repose,
   Embarking on a quest that nobody chose.

   A speck of dirt, mundane, minute,
   Guarded by atoms, in explosive pursuit.
   Their deadly legs, marching, resolute,
   The specter of destruction, absolute.

   Yet undeterred, our hero strode,
   Down the path that destiny showed.
   With dragon's fire and hippo's load,
   To face the atomic, legged abode.

   Mountains crumbled, rivers fled,
   The sky above turned fiery red.
   Yet the half-dragon, half-hippo, met the dread,
   His heart on the quest, his mind on the thread.

   "Begone!" he roared with a dragon's ire,
   Faced the atomic legions, an unholy choir.
   Against the speck, he breathed his fire,
   In the face of oblivion, he did not tire.

   He summoned his strength from the hippo's well,
   Against the explosive onslaught, he fell.
   But in his heart, a courage swell,
   Against the speck of dirt, he'd rebel.

   With one mighty heave, the ground did shake,
   The atomic army faltered, began to break.
   He plunged into the heart of the quake,
   For his home, for his world, for all of their sake.

   The speck of dirt, it's reign thus end,
   No longer a threat, no longer a fiend.
   Victorious stood our hippo-dragon friend,
   His quest completed, his land to mend.

   In tales of valor, they'll speak his name,
   The dragon-hippo who tamed the atomic flame.
   In epic sagas, his heroic claim,
   Against a speck of dirt, he won the game.
Illustration:

https://imgur.com/a/AOHlBkf

(you will get a cautionary warning about NSFW but don't worry... the picture is SFW)

I mean you could argue that this entire response is just stochastic parroting. But my point is you can't say anything either way. We don't know how these LLMs came up with the poem or the illustration. But one thing we do know is that none of what I posted here is a copy of anything that exists.

Is it objectively better or equal to what humans can produce? I don't know. You can try to Prove me wrong. Write a better poem and draw a better picture in less time.

I'm sorry, I didn't ask you for a poem-like text generator.

Your claim was:

> LLMs Can produce output that is equal in intelligence and creativity to humans. It can even produce output that is objectively better than humans.

I don't see this poem about half-dragon / half-hippos as particularly creative, but I'll preempt the "my opinion vs your opinion" with this: it definitely does NOT surpass what humans can come up with. Human poems are unarguably better.

And this word salad of a poem definitely fed from human creations and is derivative of them.

I didn't ask whether LLM could create poem-like texts.

You asked for examples where it could do better than you and you stated it couldn't be creative. I gave you an example both in text form and in picture form where it is creative and it does better than you.

First this proves it can do better than you. The word salad is likely better than anything you can come up with. Again feel free to prove me wrong here by doing better. Draw me a better illustration and write me a better poem. These are your initial points. Stick to the point and prove me wrong. Do not deviate.

Second there is no denying this is creative. Both the picture and the text are the definition of creative. Whether it's a poem or not is besides the point. Whether it's "particularly creative" or not is also besides the point. The picture and the text prove your initial points wrong. I will be sticking to this point until you prove otherwise. Until then I request you do not deviate the conversation to alternative points.

> You asked for examples where it could do better than you

No. I suggest you read again. Or is that "you" a collective for "humankind"?

> First this proves it can do better than you.

No. You are misusing the word "proof" in a dishonest way.

> The word salad is likely better than anything you can come up with.

Feeling combative, are we? You know nothing about me. I don't feel compelled to write anything for your amusement; I suppose that makes me different from a LLM-powered chatbot.

> The picture and the text prove your initial points wrong. I will be sticking to this point until you prove otherwise. Until then I request you do not deviate the conversation to alternative points.

I feel no obligation to follow your whims, unlike a chatbot. The text and picture prove nothing of the sort. Besides, I didn't claim I was a particularly good writer, let alone a good poem writer (I didn't claim the contrary; I made no claims at all).

I didn't claim there is no creativity with LLMs. I claimed it's barely equal to and certainly doesn't surpass human creativity.

PS: I am very skilled at drawing (in a different style than the example) and I can easily surpass it in my preferred style. I don't find the illustration you showed very good, either.

> Guarded by atoms

atoms, not atom bombs.

> his mind on the thread

What is that?

All in all I found the poem to be really bad. "he won the game" is not something you'd hear in an epic, it generally seems to go by the gamer definition of "epic" which is just calling something epic because you can't be bothered to examine or describe it. It reminds me of Edgar A. Poe and his "draw the rest of the owl" style. "It was so foreboding and beyond human imagination". Show, don't tell.

It breathed fire, it was so heroic and resolute and a lot of other adjectives just floating about, there is no fight at all - that all is skipped, the army faltered (because fire was breathed on atomic bombs? okay?)... it's just a bunch of filler text with no substance, I can't imagine any sequence of events based on this.

And one of the images shows several people riding on the hippo, with another hippo in the background, totally failing the assignment. None of them show atomic bombs on legs, and don't even attempt to depict a speck of dust.

Bad poem. But creative. It took some creative liberties which you did not like. Also the LLM took creative liberties on the picture, similar to a human. I guess if a human drew extra people in some mock up I would automatically assume that human is a robot. Makes sense? No.

As for the spec of dust. It's there , it's just too small for you to see.

I guess you not liking the poem is now the demarcation for intelligence? Come on man. This poem is better than anything you can come up with and it's creative.

Hmm as for the nukes. That one is your most legitimate claim. It definitively failed in that respect. But I would hardly call that a clear sign that it's not intelligent. This is more a clear sign that the LLM is not understood. We don't know why it didn't draw the nukes. To say it didn't because it's not intelligent? Well that's too bold of a claim.

>> Compose a poem about a half dragon half hippo on a quest to defeat a speck of dirt. The speck of dirt has armies of atomic bombs on legs defending him. Make the poem epic, heroic and eloquent.

This is certainly creative. But, if I understand correctly, this is your prompt, yes?

A composition of a poem from this prompt is creative. The poem and the picture had to fill in elements not included in my prompt.
It's creative (though possibly gramatically correct word salad from human sources; no small feat, but not exactly what's claimed either).

What is not is good poetry. Certainly no proof that LLMs can surpass humans.

Then write a better poem. Draw a better picture.

I wouldn't say this example surpasses all humans. It surpasses most humans and matches those trained in poetry and in illustration. Where it does definitively excel is timing. Both the poem and the pictures were generated in less than a minute. No human can create that quickly ever. Even the best of us cannot match that in speed.

What I mean is that the OP's prompt to the LLM is creative, not the LLM's output. The LLM's output just expounds on the human's prompt so the poem it generated is clearly not an example of creativity.
> output that is objectively better or the same as human level reasoning... it's just not doing this consistently

I'd say the inability to do it consistently is because it's not reasoning.

I don't have access to any so I can't ask to check, but are they able to complete basic math problems (numeric or word)? I know earlier ones would say like 5*7 was 96 etc
GPT4:

> What is 5*7?

> 5 multiplied by 7 equals 35.

> Why?

> Multiplication is a process of adding a number to itself a certain number of times. When you multiply 5 by 7, it's the same as saying "add 5 to itself 7 times." If you add up seven 5s (5 + 5 + 5 + 5 + 5 + 5 + 5), you get 35. That's why 5 multiplied by 7 equals 35.

> (7 “why?”s later…)

> The ultimate "why" behind many fundamental rules and concepts, including those in mathematics, can often circle back to human needs and the desire to understand and interact effectively with our world.

> As humans, we've developed systems and structures, such as mathematics, to organize, interpret, and predict our surroundings. These systems help us in practical tasks like trading goods, constructing buildings, and planning time, but also in abstract reasoning and problem solving. The rules within these systems, such as the operation of multiplication in mathematics, exist because they are useful, they work consistently, and they can be universally understood and applied.

> So, why does 5 multiplied by 7 equal 35, and why does multiplication work the way it does? Because humans, over many generations, found that this system was a powerful and reliable tool for interacting with the world and with each other.

That could still be explained by that example having been part of the training set so it knows how to give the right answer. But there must be more going on than that:

The square root of 232444232 is approximately 15229.

(which is wrong, it is approximately 15246.12)

The sum of 2341347345 and 234823542354 is 237165889699.

Which is the right answer.

So there may be some special casing happening there.

I mean, I don’t know the square root of 232444232 off the top of my head either..
So you'd either work it out and check that it was right. Or you'd tell the person asking that you didn't know. You wouldn't just make a plausible answer and confidently state it. If you did that frequently, people would stop listening to you.
What does the word approximately mean, if we go back to the previous statement answer from GTP 4, and what precision is needed in the answer?

And, no I'd grab a damned calculator and let the specialized tool do the work, which it turns out if you turn on plugin mode GPT-4 can use the same tools and get an exact answer.

One difference is that you are aware that you can't do it and state so. Our current LLMs will just give whatever result they think it should be. It might be correct, it might be off by a bit or it might be completely wrong and there's no way for the user to tell apart from double checking with some non-LLM source wich kinda defeats the purpose of asking the LLM in the first place.
if you've had a high school education presumably you could work it out

it can't

I can. Newtons method is pretty easy to do in your head, but with larger numbers you need to be very careful not to mess it up. But on paper it's trivial.
Actually 15229 is a decent approximation. It’s a better approximation than the one I did off the top of my head.
You need more practice :)

Simple trick: divide by 100, 10000 or 1000000 use Newtons method on the integers, then multiply by 10,100 or 1000 and add a 'fudge factor' based on how large the fraction was...

It's cheating but it can get you pretty close, I'd peg it at 15240 using that trick. If you just want to do the closest squares you can average between 15 (225, too low) and 16 (256, too high) so you'd guess 15500, which is much too high, but one more iteration of Newtons method gets you closer than what chat gpt gives. You can already see that because 225 is much closer than 256 and that puts you closer to 15250 than 15500. And 15250 is actually not a bad guess at all.

And if chatGPT said “I don’t know the actual answer but my best guys is 15229” that would be a reasonable and potentially useful answer.

The fact that it gives you a number that isn’t rounded to the nearest tens, hundreds, or thousands place means that it doesn’t look like an approximation to any reasonable person, which makes it a terrible answer.

My younger brother used to have this problem. If you asked him a question like “how long until you get here”, he’d say “17 minutes”. What he really meant was “around 20 minutes”, but everyone thought he must know the exact time. Like he’d done the drive many times, or he was looking at his GPS.

> So there may be some special casing happening there.

100%. Maths was a notable weakness of earlier GPT versions, so ChatGPT-4 has a layer to direct mathematical queries to an evaluator.

Yes but there is a limit given the fixed number of operations the model has and the order it needs to solve them, for example 99+1= will need to give 1 as token output and to do that the model needs to solve all the carries in one go.
To be a scientist, the LLM should be asking fundamental questions (define hypotheses) on its own without human input and try to come up with answers.
Human scientists don't spontaneously grow on trees, they're being taught to ask such questions. LLMs could be too.