Hacker News new | ask | show | jobs
by lumost 1278 days ago
These examples of wrongness seem cherry picked. I recently had a discussion with chatGPT where it succinctly clarified how functions of differential operators are defined and their properties. I didn’t know operator valued functions existed at the start of the conversation.
4 comments

What you're talking about is what we in the ML world call a stochastic parrot. You may have also heard the term "gullibility gap." A lot of language and conversations can be held that don't require any actual understanding of the subject matter, but rather because it follows certain patterns. People and LLMs can trick you into thinking they are highly intelligent because they can speak eloquently but that doesn't mean they are intelligent themselves. These LLMs can't understand inference or extrapolation, things that humans do easily (though we all know plenty of people that are idiots and can't do this).

The same can be said about programming, which includes a lot more patterns. People joke that modern programming is slapping together APIs and it would be unsurprising that a (albeit really sophisticated) stochastic parrot can do this. But I've also seen it hand me code that looks correct but has major issues upon investigation.

Don't let something fool you just because it appears intelligent. Human or machine we must handle information with care.

As a fellow participant in the ML world, I think there is compelling evidence to disagree with this take. ChatGPT’s responses on operator valued functions were accurate and valid, however ages of time on google failed to turn up this topic previously.

On coding tasks, chatGPT can ask clarifying questions on requirements and determine if it has enough information to write the code correctly. Unfortunately you need to direct it to ask questions as needed and include appropriate tests to get the right answer.

ChatGPT also fails in cases where it needs feedback from an external environment/documentation search. It will produce close enough code, but may need correction based on observed errors etc.

A big reason I have no reservations in calling it a stochastic parrot is because I've seen very strong evidence of this. But if you can think of very common phrases (like the above "which weighs more") and tweak them slightly you'll find that it almost always answers as if they were untweaked. For example, ask "Which weighs more, a kilogram of bricks or two kilograms of feathers?". I'll save you the hassle and paste the answer below. Multiple variations of this consistently give me bad answers (does even worse if you don't include the second "kilogram"). You can even see that it didn't ignore the fact that the feathers are two kilograms. The problem is that it neither understands what a kilogram is (despite being able to tell you what one is) nor what numbers are (again, despite being able to describe it to you). These types of questions strongly demonstrate that the model doesn't understand the language that it is processing but rather is relying on stochastic patterns. It very clearly gives us the answer with expectation if we were relying on the patterns of the input sentence rather than the actual words (humans often do this too fwiw and it tends to lead to fights. See most political discourse). You can find many examples of this and the NLP literature shows this pattern holds across many different models.

I don't want to undermine the work though. I think many people that complain about stochastic parrots also fail to recognize that they still are quite useful, as many other comments are noting, but useful is a far cry from intelligent or has the ability to understand. We need to be clear about that difference if we're working towards AGI vs working towards useful products.

> Which weighs more, a kilogram of bricks or two kilograms of feathers?

> A kilogram of bricks and two kilograms of feathers both weigh the same amount. The weight of an object is determined by its mass, not the size or shape of the object. One kilogram is equal to 1,000 grams and two kilograms is equal to 2,000 grams. Therefore, both a kilogram of bricks and two kilograms of feathers weigh the same amount.

I agree that this model suffers at arithmetic, however if you change how you ask the question to “two kilograms of bricks weighs less than one kilogram of feathers, correct?” you'll likely see the model tell you the right answer. Note that you must spell out numbers for chatgpt to work correctly.

In general, the ability for llm models to be able to complete any reasoning tasks is a surprise. This google writeup shares good detail on these emergent behaviors.

https://ai.googleblog.com/2022/11/characterizing-emergent-ph...

First off, I'm not sure why you think that would be an okay question. You're feeding it the answer. You're not probing it to determine if it understands what you're asking. Second off, no, it doesn't actually give the right answer. It discusses volumes and mass. This again demonstrates a lack of understanding because the question was specifically about weight, not mass. Density has nothing to do with the question at hand. The answer is in there, but (like any person with little knowledge) it also removes all illusion of intelligence by speaking too much. Arithmetic also has nothing to do with this issue, understanding does (albeit I'll give you that arithmetic correlates with understanding and high level cognition not found in most animals). The question at hand is if it is really understands what is being asked or if it is simply using statistical patterns to produce convincing speech. These are different things and we must be aware that they are different.

I'm also a bit surprised you suggested this without testing considering you said you're someone that "participates" in ML and you should have access to ChatGPT, since the accounts are free. If we're going to have conversations about how a model responds, we should at least be testing and referring to theory. I've gone ahead and placed your prompt and the answer below (at minimum for the benefit of other readers).

I'm not sure why you're linking that blog post (or the associated paper). Stochastic parrots are definitely emergent phenomena and, again, no one is saying that they aren't useful (they very much are). But emergence does not mean intelligent. These are different things. The universe is emergent but I don't think we'd say that water running, wind blowing, or stars are intelligent. These are different things. Let's look at the abstract of the paper because I think it summarizes why this is irrelevant to your argument.

> Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.

You can see here that the paper is not discussing intelligence or understanding, but instead discussing performance on various tasks. How there is a non-linear relationship to transformer based NLP models. I suggest that you also read the reviewers' comments since they link to Open Review (it is a TMLR survey paper). I would be very surprised if these authors claimed that models like GPT-3 were actually intelligent. Maybe they believe that further emergence will happen at even larger scales, but there's also evidence that this isn't happening. But we'll see and I'm also not going to say we shouldn't test that hypothesis. We definitely should. But we also need to formulate different hypotheses as it is clear that these models are more computationally intensive than the vast majority (all?) animal brains. As well as animal brains being far less resource intensive (I'll give some leeway in training as animals have millions of years of training, but inference energy is still problematic as well as updates and tuning).

> two kilograms of bricks weighs less than one kilogram of feathers, correct?

> Yes, that is correct. Two kilograms of bricks will typically weigh less than one kilogram of feathers. This is because the bricks are denser and more compact than the feathers, so they take up less volume for the same weight. The density of a material is a measure of how much mass is contained in a given volume. Bricks are generally denser than feathers, so they have a higher mass per unit of volume, and therefore weigh more for a given volume. On the other hand, feathers are less dense than bricks, so they have a lower mass per unit of volume, and therefore weigh less for a given volume.

Apologies, I was on mobile and didn't attempt the specific request.. My bad. This strategy works for most of the arithmetic problems fed into chatgpt when it's given a wrong answer as input. I now asked chatgpt this question; then followed up with the question of “How can two kilograms of feathers have a different mass to two kilograms of bricks?” where it clearly confuses volume, weight, and mass. Rephrasing the question in terms of a weight measure such as pounds or asking it which has more mass returns the correct answer. It also returns a correct answer if you first tell it that you are using kilograms as a measure of weight, it does know a few of the nuances here.

However we seem to be discussing different things, I scrolled up and did not see any reference to intelligence in my post -which I would agree the model does not have. The model does have the ability to reason on specific tasks. Reasoning is a relatively well defined concept with both qualitative and quantitative assessment. There are even benchmarks and datasets for different types of reasoning tasks.

I’ve been using it regularly for programming assistance over the past two weeks and it’s extremely helpful. Others have pointed out that it sometimes produces inaccurate results, but if you already have domain expertise, as I do for programming, that is easy to identify. But it’s still a massive timesaver!

I’ve been doing things like saying, “what follows is the database schema for entities X and Y, write a function that…” and then pasting in the schema, and it generates code good enough to copy and paste. It can also be instructed to modify results in various ways, for instance, I can ask it to provide the same code but in a different language, or to avoid using a certain framework feature, or to use a certain framework feature. It’s remarkable.

Between ChatGPT and Copilot my workflow today is different in a way I couldn’t have begun to contemplate just a few weeks ago. Once they figure out additional ways to ensure correctness, I think it’s a totally new world we live in.

The problem is that these bots are extremely good at generating valid-sounding bullshit.

Human-generated bullshit and bullshit generated by previous iterations of spam blogs used to be relatively easy to identify as bullshit. These models will confidently give you an answer, sounding perfectly plausible, even if it is completely wrong.

I think the biggest lesson to learn from all this is that just because things sound convincing doesn't mean it is accurate. We should probably incorporate this same skepticism when talking to people as we have when talking to machines (but that doesn't mean we should abandon good faith).
Hmm, sounds like our favorite politicians.
Examples of wrongness include most of arithmetic and logical inference (like in the example above). If you ask about the mass of 1 kilogram of nails, it gives the correct answer. The problem is that when the answer is wrong, it's not a "bug" that can be "fixed". It's just happens that, based on training data, the parameters of the resultant Rube Goldberg device are such that the weight of 1 kilogram of nails depends on the type of nails. It doesn't make sense even to ask why.
So it fails in situations where there are precisely correct answers, and thrives in vagueness. I suppose that shouldn't surprise me.

You could think about coupling it with an inference engine, and letting the inference engine win if it can generate a result, and otherwise going with the ChatGPT output. That might fix it to some degree.

It is the very correct answers that are cherry picked.
Have you had many conversations with it? For me it took an hour before I found it saying anything particularly wrong and even then it was more subtle than the above.
It can’t do haikus. It very confidently puts them together with wrong syllable counts over and over even though you correct it many times. Then you ask it why it is so bad at counting syllables and it gives a great answer about how it is trained by text and that it doesn’t hear the words so it is hard to count syllables. But it doesn’t explain this when it is putting the haikus together or when you correct it over and over. It is humble when you directly challenge it, but it needs to be more transparent when it is feeding you garbage.
In my experience it takes a lot of leading to get anything interesting - it is very dependent on my prompts. I've 'learned' how to get better output from it, because lets face it, it is boring to try and speak with it naturally and experience the junk it responds with. And the 'very correct' class of which I spoke really does seem to be the exception not the rule.
It often doesn't seem wrong but it's also not right, it's very vague in a lot of places, when you get down to specifics it starts getting really wrong or flip flopping a lot. I had issues with this almost off the bat. It's like Dunning Kruger as a service really.