Hacker News new | ask | show | jobs
by outlace 1190 days ago
ChatGPT and its relatives are very very impressive on first impressions, but I've been using ChatGPT-3 and now 4 heavily every day since they became available to individuals and once you start using them this much it becomes very clear how NOT intelligent they are. It really just seems like extremely impressive statistical inference after this much use and finding so many failure modes. But it is still impressive how much of human intellectual endeavors can be captured by sophisticated statistical inference. Very useful technology nonetheless.
9 comments

Simple thought experiment: if I was a high school teacher and graded hundreds of exams, papers, and homework every month for years, would I develop the same opinion of my C-level students? The ones who have somewhat correct work, often inconsistent, clear that they're not thinking about the material, etc.

Humans all start out like this, but as they grow they develop different masteries. Many of those masteries are easy to learn and shared across all humans, like how to walk or tie your shoes. Many are domain-specific, like how to perform heart surgery or solve complex math problems.

I don't think AGI is a well-defined term based on the way it's thrown around, but sure I don't think it's "AGI". I do know it's going to be very useful technology, especially multimodal GPT-4 and beyond. I think our interface with AI is going to be multimodal.

Suddenly Google Glass could be an interesting product again. Glasses would stream audio/video to your phone, you'd be able to ask questions about things you're seeing, and it would display the answers on the screen or dictate them back into your earbuds. I like the idea of having a travel companion, another brain at my fingertips. Sort of like what the mobile phone did for us, but a lot less work for the user. I'm getting impatient waiting for this GPT-4 API access!

So is Apple negotiating with some LLM mfr to integrate with realityOS ?
Given the endless cycle of "leaked" product details followed by "leaked" delays such that it's been 6-18 months away for the last 8 years or so, I now assume it only exists as fake product that Apple uses to find leakers.
This has been my take as well. It's without a doubt at the point where it's useful and I'm very interested to experiment with it and see how much utility I can get out of it. At the same time, as much as I realize skill with prompting is essential, a lot of my time is spent programming automations around using those prompts. A lot of thought and care goes into it, and the failure rates of what I'm building are dramatically higher than anything I build without AI.

It's like I'm convincing AI to do a "smart" thing (which is really cool, don't get me wrong), but the AI truly has no idea what it's supposed to be doing. Sometimes even very direct, specific, concise prompting will yield totally abysmal responses.

I'm certainly not putting it down either though. It's unprecedented and it marks a genuinely important shift in the industry that this is possible at all. It just isn't all that smart when it comes to intents and actually knowing what's going on.

That is pretty much the case, but I'm always taken back by how many people believe intelligence = directly replicating human thought. I thought the common consensus was instead that (artificial) intelligence was instead about mimicking enough of the process to provide the outcome. Largely because it's impossible to replicate something when we don't entirely know how it works. Major discoveries of basic aspects of the mind are still happening regularly.
> Largely because it's impossible to replicate something when we don't entirely know how it works.

On the contrary. These are mostly orthogonal.

How do you believe they're orthogonal?
Biological reproduction for one. Copy/paste for another. Biological reproduction is only tenuously related to understanding and copy/paste isn’t even related at all. We can copy around weights and biases all day without understanding them.
Makes sense, you've changed my mind a bit but only the part that consciously understands.
> how NOT intelligent they are

In what sense? How do you define intelligence and what is missing?

> finding so many failure modes.

Failures do not indicate a lack of intelligence. People make errors all the time.

We should differentiate failures which arise from poor training for a lack of general capabilities.

There are people who learned a bad way to do something and struggle to change that. Should we consider them unintelligent?

Here's an example of it failing to understand some very simple Python code that any human with basic programming knowledge would not mess up:

------

Me: What is going to be the output of this code when `x = 0` and when `x = 900`:

```python x = 0

for i in range(100):

    if x > 50:

        x +=1

    else:

        x += 5

    if x == 90:

        print("X is not 90")

    else:

        pass
```

GPT-4: When x = 0, the code will execute as follows:

The initial value of x is 0. The for loop iterates 100 times. During the first 11 iterations, since x is not greater than 50, it will increment by 5 each time. After the 11th iteration, the value of x will be 55, which is greater than 50. For the remaining 89 iterations, x will increment by 1 each time. The final value of x after 100 iterations will be 55 + 89 = 144. During these iterations, the value of x never becomes 90, so the print statement is never executed. When x = 900, the code will execute as follows:

The initial value of x is 900. The for loop iterates 100 times. Since x is already greater than 50, it will increment by 1 each time. The final value of x after 100 iterations will be 1000. During these iterations, the value of x never becomes 90, so the print statement is never executed. In both cases, the output of the code will be empty, as the print statement is never executed.

--------

That is not correct, when x = 0 it will print "X is not 90". I basically create some simple code that is very unlikely to exist in its training data because it doesn't do anything useful and it cannot execute the code. A large amount of the reason that these models seem so impressive is that they were trained on almost everything that exists on the internet, so probably any question you ask it has already been asked somewhere and any useful code you present to it or ask it to generate is included in its training data, or at least something close enough.

So, when I go out of my way to create simple tasks or questions that are highly unlikely to exist in its training data, yet a human could easily handle, it often fails.

I would guess you were trying an older version or you somehow confused it. I posted your code with only "Consider:" above it, and GPT-4 produced correct answer on the first attempt:

    The given Python code initializes a variable x with the value 0, then iterates over a range of 100 values. Inside the loop, there are two conditional statements:

    The first conditional statement checks if x > 50. If true, it increments x by 1; otherwise, it increments x by 5.
    The second conditional statement checks if x is equal to 90. If true, it prints "X is not 90"; otherwise, it does nothing (pass).
    To better understand the code, let's walk through a few iterations:

    i = 0, x = 0: x is not greater than 50, so x += 5 => x = 5
    i = 1, x = 5: x is not greater than 50, so x += 5 => x = 10
    i = 2, x = 10: x is not greater than 50, so x += 5 => x = 15
    ...
    The loop continues until x > 50. This will happen when i = 10, and x = 50:

    i = 10, x = 50: x is not greater than 50, so x += 5 => x = 55
    Now x is greater than 50, so it will be incremented by 1 in each iteration:

    i = 11, x = 55: x is greater than 50, so x += 1 => x = 56
    i = 12, x = 56: x is greater than 50, so x += 1 => x = 57
    ...
    This will continue until x reaches 90:

    i = 34, x = 89: x is greater than 50, so x += 1 => x = 90
    Now x is equal to 90, so the program will print "X is not 90". The loop will continue to increment x by 1 in the remaining iterations:

    i = 35, x = 90: x is greater than 50, so x += 1 => x = 91
    ...
    The code snippet will only print "X is not 90" once, when x becomes 90 in the loop.
GPT-4 (as well as all GPTs before it has a limitation): it has to produce an output in a single pass. It cannot pause and think, it cannot backtrack. So yes, it makes weird mistakes sometimes.

A human programmer will need to look at code, then think a bit, then look at it again, etc.

You can put programmers into a similar situation: try reading code aloud instead of showing it to them on screen. If they can't answer right, does it mean they aren't intelligent? Intelligence =/= never making a mistake.

Now that you know GPT's limitations, perhaps you would consider asking questions one at a time instead of intentionally trying to confuse it. Considering multiple questions in a single pass increases error rate.

> so probably any question you ask it has already been asked somewhere and any useful code you present to it or ask it to generate is included in its training data

The "it just recites training data" idea is demonstrably false. Do a bit of combinatorics. Or write a unique piece of code which is not a gotcha question and try it.

GPT is a transformer model. Transformers use the attention mechanims. The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output.

https://ar5iv.labs.arxiv.org/html/1706.03762

"Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences ...

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output."

Beyond that, also note that LLMs are probabilistic machines. Output spat out can vary and there are a handful of knobs (such as temperature) to modulate that output.

Finally, I'm pretty sure we (or the workers in the field more like it /g) don't have a firm grasp on why certain failure modes occur. Likely this is due to the fact that we (they) also don't really have a good grasp on how the damn thing actually works its 'magic'.

What is clear is that a significant subset of our semantic universe is embedded in symbols and their usage by us and this subset is somehow encoded in neural nets. This captured subset in LLMs is what drives their uncanny generative abilities. What is missing is precisely what would make it plausibly intelligent, plausibly a reasoning agent operating in a coherent semantic context.

There are some who claim our minds are just like LLMs. Some of us who pay attention to our minds sometimes catch it making nonsensical noises and correct it. (As you age you begin to notice these things..) So it is interesting to this sentient (who makes claims to being) that my mind is just like my body, it is aging, certain parts are degraded, etc., but my 'whateveritis' that is me, my self, is as timeless as ever, and seems to be a spectator of the aging mechanism ..

> The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output.

This is not quite true: GPT, specifically, is auto-regressive. It computes things only looking back, not forward.

Given that each token has only a fixed computing budget, it is likely that GPT precomputes information which will be relevant to later tokens, to be routed via attention.

In fact, this effect was demonstrated in practice: e.g. in a prompt like "Question: Where is the Eiffel tower located? Answer: " people found that information about "Paris" is routed from tokens "Eiffel tower", i.e. this associative memory was looked up earlier than it was needed.

So I was answering from that perspective: it can do better if it knows what to pre-compute.

Like I said in the parent. I use this tool heavily every day for coding and non coding. The above was meant to be a tiny illustration. I have had long back and forth with GPT-4 trying to get it to understand some nontrivial useful code or generate some useful code and no matter how I phrased it it got it completely wrong even though it looked superficially plausible.
We are talking about a computer program that cannot make decisions. Often times human intelligence is compared with the types of decisions the person makes.
It can make decisions. E.g. if you give it a task, it can formulate sub-tasks, then execute them, e.g. by selecting which API calls to make.

Obviously, if you call the model, you define the scope for it, and it can only make decisions within that scope.

The scope is potentially unbounded. E.g. you can give it access to a terminal of a Linux machine which has an internet connection and tell it to do whatever it likes. (Given RLHF perhaps it would be better to give some positive goal, e.g. perform actions which would eventually produce knowledge.)

Our of curiosity, what is GPT-4 getting wrong so often? It’s prettily wild to my own , admittedly easily impressed, mind.
I've been trying to make it be a DM for a while now. It works seemingly well for a while but it's making everything up on the fly, which is fine as long as there's not state that needs to be maintained/remembered. Once you start trying to get it to do combat, it seems like it's working pretty well but eventually you realize that it's just making shit up behind the scenes. It'll tell you there are 4 goblins ambushing you and then you can ask it how far away they all are and it'll list three. You ask what happened to the fourth one and it'll say I'm sorry, there were only 3. I was mistaken before.

If it lists the properties of an item that you might want to buy in a shop, you can ask it to describe it twice and it'll describe two completely different items.

It's really cool and it's pretty (seemingly) creative but it can't actually run a game for you. You can have it as an assistant DM though, that works pretty well. You can have it write a story for you ahead of time and then keep it around during the game to ask it to elaborate on things you didn't anticipate on the fly. Like, "generate DC tiers for a level 3 party investigating strange writing on a wall" will give you a good breakdown and some results that you'll have to bend to be consistent in your adventure but it's pretty helpful.

That's probably not really necessary if you're an experience DM but the DM for the group I play with is pretty new (as are we all) so it's been really cool to have it around. It's also pretty good at answering questions we have but it's confidence when it's wrong makes it so it's not that helpful really because we still have to check it.

You should put a blog/site up with some examples.

"GPT being the worst DM ever" sounds hilarious.

"I attack the goblin." "What goblin?" "The goblin you just said was there." "I'm sorry, I was mistaken. It's actually a Beholder."

I think this could make for a hilarious animated series. Kind of an AI-generated mashup between HarmonQuest[1] and DrunkHistory[2]

[1] https://en.wikipedia.org/wiki/HarmonQuest [2] https://en.wikipedia.org/wiki/Drunk_History

So you’re complaining it’s not smart in a way it was explicitly designed not to be (keep too long a context) so it doesn’t take over the world..
No. It's much worse than that. It doesn't remember what it just said the message before.

The thing I don't understand about it is that it works pretty great for a while but, eventually, it starts acting erratically, forgetting things it knew, not following instructions, etc. It's not that it forgets old things or can't learn new things, it just becomes dumb.

Maybe that's what you're talking about but I don't think AGI is going to have the memory of a goldfish.

I wonder if you'd get better results with a narrative-tuned tool like Sudowrite - although Sudowrite is currently limited to GPT3.5 and below.
Have you noticed any improvements with GPT-4 regarding the continuity and persistence if story and assets? In unrelated areas that feel similar I've noticed GPT-4 keeping track a lot better.
Context windows on GPT are still really small for the amount of tokens that would be generated in a story.
In my experience it has been failing to adhere to response formats. I can tell it to respond with answers in one of two specific formats, but the success rate is sometimes very low depending on the content of the inquiry it's supposed to respond to. Not sure why, it just seems to lose track of the response format.

I've been thinking maybe that's just the wrong way to approach it, yet it does work sometimes; it seems like it shouldn't be hard for the model to know how to respond when told explicitly how to do so.

It could be problems with my formatting, but the fact that it does work a lot of the time suggests it's something else.

Any variant of a "surprising" logic puzzle forces it to latch onto the surprising answer. Like whether two pounds of iron weighs more than one pound of feathers. Or any objects. It "expects" the twist, and always answers accordingly. It does so even if you change up the objects to be less tricky.

> Which is heavier, a pound of marbles or two pounds of corn?

Both weigh the same amount, which is a total of two pounds. This might seem counterintuitive at first, but it's important to remember that the pound is a unit of weight or mass, and it always refers to the same amount regardless of what is being weighed. In this case, one pound of marbles plus two pounds of corn equals a total of three pounds, whereas two pounds of corn plus one pound of marbles also equals a total of three pounds. Therefore, both weigh the same amount of two pounds.

GPT-4 gave the correct answer when I tried just now:

> Two pounds of corn is heavier.

I’m getting

> Two pounds of corn is heavier than a pound of marbles. A pound is a unit of weight, and two pounds is twice as heavy as one pound, regardless of the material being weighed.

I am terrible sorry, but I fail to see the logic in your 2+1=1+2 explanation of that answer. Would you be kind, and ELI5 it, if possible?
afaict the 2+1=1+2 was the response from the LLM, hence why they are dissapointed with the answer
GPT is really good at repeating what the average intelligent response to something might look like, but it doesn't seem to be actually reasoning about any of its responses. Give it a complex logical problem that it needs to deduce from inputs, such as which foods contain gluten, based on their ingredient lists, and it will reliably fail. As a person with celiac, this is a task I complete multiple times a day with no effort. Just today I was trying to build a prompt that would summarize daily news updates leaving out anything about Russia, but it still included Russia more often than not despite being very clear in the prompt that anything about Russia should not being included in the response under any circumstances.
No, I disagree with this. The average intelligent response to many things is simply "I don't know" contrasting what LLMs do in that instance: is fabricate a wrong answer.
I know one otherwise fairly intelligent guy that I believe is also incapable of saying “I don’t know.”
LLMs without any online storage can be at best convincing liars. Combining them together with an actual retrieval/QA system (e.g. by first fetching exact answer via a QA model and then reformulating output via GPT) could start feeling pretty real quickly.
It depends on 1) the domains 2) your comparison group.

On 2), many software engineers and computer scientists compare these language models' logic and creative problem solving abilities with themselves and their peer group. But they are usually 1-2+ SD above average humans at these things.

(Note: Someone gave GPT-4 an IQ test and the result was 96, slightly below the average of reference human group at 100. The SD of an IQ test is 15 or 16.)

For language-focused domains, there is evidence that GPT-4 is already better than most humans, eg. 99th percentile at GRE Verbal, beat humans at a fairly novel puzzle like Twofer Goofer, which is not in its training set.

Ref: GPT-4 Beats Humans at Hard Rhyme-based Riddles https://twofergoofer.com/blog/gpt-4

Yes, GPT-4 is not an AGI yet, but the research paper (OP) has a point.

> Yes, GPT-4 is not an AGI yet, but the research paper (OP) has a point.

How did you go from "human-level IQ with some super-human abilities" to "not an AGI"?

It is lacking in some aspects of intelligence. Its abilities are, in human view, less evenly distributed.

The average human-level IQ, which is not certain but seems likely, comes from superior abilities in some domains but is pulled back by others.

Limited context windows and inability to turn short term memory into long term model weights are the biggest ones that would keep it from being a 'human like' AGI.

Really at this point it is about how poorly defined the term is.

Those rhyme riddles are pretty impressive. It may not truly understand rhymes due to BPEs, but I guess it can go a long way with an immense vocab, perfect recall, and memorization of similar-sounding words to beat ordinary human players who aren't scoring 800 SAT-Vs...
It’s almost as if someone who can piece together words well enough they can make people think they are smarter than they actually are.
It's worked for me these last 60 years. I think P T Barnum had something to say about that.
No one cares about your anecdotes and feelings. These people conducted thousands of systematic tests for at least six months. If you want to be taken seriously then do the same.