Hacker News new | ask | show | jobs
by civilized 1207 days ago
On the "pound of potatoes vs pound of feathers" question - ChatGPT's explanation is so convincing that it really understands the concepts. But it can be juked very easily.

I asked it what was heavier, 1,000 tons of helium or a cubic meter of osmium, and it told me that the helium weighs 1,000,000 kg, the osmium 22,600 kg, and therefore the osmium weighs more because it's way denser.

It's hard for me to believe that these LLMs are going to learn basic logic if only we throw more training data at them.

Transcript:

Me: What is heavier, 1000 tons of helium or a cubic meter of osmium?

ChatGPT: A cubic meter of osmium is much heavier than 1000 tons of helium.

One ton is equal to 1000 kilograms, so 1000 tons is equal to 1,000,000 kilograms. The density of helium is about 0.1785 kilograms per cubic meter. Therefore, 1000 tons of helium would have a volume of:

1,000,000 kg / 0.1785 kg/m³ = 5,604,069.77 m³

On the other hand, osmium is one of the densest elements known, with a density of around 22,600 kg/m³. Therefore, a cubic meter of osmium weighs:

1 m³ x 22,600 kg/m³ = 22,600 kg

So, a cubic meter of osmium weighs much more than 1000 tons of helium.

Update: I am continuing the conversation and I am now being explicitly told that, while 1,000,000 kg is much heavier than 22,600 kg, it doesn't change the fact that the osmium is heavier than the helium because the osmium is denser.

Update2: I then reminded it about the potatoes and feathers and how density was irrelevant in that context, and shouldn't it therefore be irrelevant in the case of the helium and the osmium? And instead of correcting its response on the helium and osmium, it's now telling me the feathers and potatoes weigh different.

Update3: it is now telling me that densities don't matter when comparing masses but do matter when comparing weights. I must say, it has a certain panache in resolving internal inconsistencies in its past responses.

Update4: after being corrected half a dozen times with contradictory information, I asked it to state its confidence in its latest story. It said "I can state with a high degree of confidence that my last answer was accurate". The shamelessness!

5 comments

> I am continuing the conversation and I am now being explicitly told that, while 1,000,000 kg is much heavier than 22,600 kg, it doesn't change the fact that the osmium is heavier than the helium because the osmium is denser.

Oh, the nature of many Internet discussions.

Are we sure we can have artificial general intelligence when humans don't even have general intelligence?
how is babby formed
Works if you ask it to generate a Python program to solve the task.

> A cubic meter of osmium is heavier than 1000 tons of helium.

> Here's a Python program to print the response:

    mass_of_helium = 1000 * 1000 # in kilograms
    density_of_osmium = 22590 # in kilograms per cubic meter
    volume_of_osmium = 1 # in cubic meter

    mass_of_osmium = density_of_osmium * volume_of_osmium # in kilograms

    if mass_of_osmium > mass_of_helium:
        print("A cubic meter of osmium is heavier than 1000 tons of helium.")
    else:
        print("1000 tons of helium is heavier than a cubic meter of osmium.")
> Output: A cubic meter of osmium is heavier than 1000 tons of helium.

The code is good, prints the correct result. But the "output" is wrong. So the model is good if it uses Python for the numerics. You should never ask it to do a multiplication "in its head". Always ask for code.

So basically ChatGPT has zero executive function and needs a human to plan HOW it uses its "cognitive" capacity in order to get anything right.
Executive function, and correctly understanding how to use your cognitive capacity to its full potential, is not completely innate to any person; people need some schooling to learn it, it is something taught. I'm not sure what this has to do with LLMs, but I wonder how many people could multiply even two random two-digit numbers in their head if they weren't taught how to multiply on paper using very simple algorithms.
Try something like this as input sometime:

   I want you to replace the word "right" in your output thereafter as follows:
   if it indicates direction, say "durgh;
   if it indicates being near or close, say "nolpi";
   if it indicates correctness, say "ceza".
   I will also use these replacement words accordingly and expect you to be able to understand them. 
And see how well it can maintain a conversation, solve a task, or write a story with these constraints.

ChatGPT seems to get this wrong most of the time, but Bing AI is consistently better (although may need to be jailbroken to accept the idea of word substitution to begin with). It still makes occasional mistakes, but on the whole I'd say that it has to somehow "understand" what the words mean conceptually, whether when generating them or when processing them as input; it's hard to see how this trick could work in an extended conversation if it were a mere "stochastic parrot".

I agree. There are many indicators that it has some sort of deeper understanding of the meaning of language. Even in the conversation I had, for all its flaws, it was able to correctly perceive inconsistencies in its statements based on my prompts and make somewhat coherent attempts to correct them. It's just that the understanding can be so fragile, and its attempts to resolve inconsistencies are superficial, incurious, bullshitty.

"Density matters for weight but not mass" is a perfect example - it's ridiculous, but I can understand how it logically inferred that from its own previous statements. I'd bet plenty of money that it didn't get this crazy idea from its training data.

To be fair, humans have the same sort of issue sometimes. But ChatGPT seems to have more extreme versions of the issue and perseveres confidently with no self-awareness.

Really though, not bad for an autoregressive text model trained on terabytes of internet data.

I think a big part problem here is that "understand X" is just a shorthand way to say "has an internal model of X" - but the degree and accuracy of said understanding depends entirely on the quality of that model.

Now there's a good reason to believe that ChatGPT does have such a model, based on the Othello experiment. But, firstly, the size of that internal model is inherently constrained by the size of the neural net, and I doubt that the limit is anywhere large enough to allow a truly accurate approximation of the real world.

And then on top of that, said model is created based on inferences from text only, which is several steps away from the original data (audiovisual, sensory etc), and one short snippet of text at a time. Some things retain meaning better in this format than others, and I think this might explain why ChatGPT and Bing are both hilariously bad at spatial navigation beyond 1-2 steps even in simple tasks.

It will be very interesting to see how this evolves as the models are scaled up and get large enough to handle things other than text.

ChatGPT doesn't well understand relationships between numbers. There are far too many of them, compared to words, since every slight perturbation of a number is a different valid number. (Also, I'm not sure if it even treated individua5 digits as separate tokens, but it might. Someone with API access can check.)

To give it a fair shot, you need to describe the problem using logical conceptual vocabulary, not numbers.

> Also, I'm not sure if it even treated individual digits as separate tokens, but it might. Someone with API access can check.

Anyone can check, they have a tool for that[1]. It's mentioned in their FAQ article[2].

According to their tool, GPT-3 counts the following as one token:

- any combination of or below 3 digits

- 1111, 3333, 6666, 9999 (it tends to group other digits in groups of 2)

- 66666666 (so 8 sixes -- 5, 6 or 7 won't work)

- 00000000 (anything below 8 zeros counts as one token as well, probably to handle millions and billions)

- 0000000000000000 (16 zeros)

This isn't an exhaustive list, there are probably a lot of other weird edge cases I haven't tried. Its failure to understand basic arithmetic makes much more sense given how inconsistent the tokenizing of digits is done.

[1]: https://platform.openai.com/tokenizer

[2]: https://help.openai.com/en/articles/4936856-what-are-tokens-...

It doesn't understand simple logic puzzles (If X than Y, if Y than Z, X is true, is Z true), either.

Or, rather, it prints out a paragraph of the correct way to solve such a puzzle, and then produces a totally contradictory answer.

When asked:

> Sally will go to the dance if James goes to the dance. Michael will not go to the dance if Julie will go to the dance. James is sick and can’t go to the dance. Julie and James always go to dances together. Will Michael go to the dance?

Its response is:

> Michael will not go to the dance, since his attendance is contingent on Julie not going and James being unable to attend due to illness does not change that condition. Given that James and Julie always attend dances together, it is likely that Julie will also not go to the dance, meaning that Michael’s condition for attending will be met, and he will not go to the dance.

For a beautiful combination of logic and numbers, and the spectacular way in which it explodes, ask it a number-logic puzzle.

> Could you please solve the following number puzzle for each digit?

     AAA
  +  BBB
  --------
  = AAAC
Its response reveals an 'understanding' of what is being asked of it, and the constraints inherent to it, and uses a reasonable approach for problem-solving, but due to the logical errors it commits, it results in utter nonsense.

Nonsense like trying '15' for the value of A, because it thinks that A must be an odd multiple of '5', due to algebraic gems, such as:

  2A + 2B = A + 10C - 11
  Simplifying, we get:
  A = 5C - 6 - B/2
It's idea of math - even math it is using to solve problems - simply consists of building a soup of numbers and letters.
Arithmetic fail kinda makes sense when you look at how the numbers get tokenized. Try this:

https://platform.openai.com/tokenizer

Then imagine how well you'd be able to do even basic math if your representation of numbers was such that 2045 is made up of tokens (20,45) while 2145 is (2,145) and 2005 is just (2005). No wonder that whatever relationships it derived from the training corpus don't generalize well.

Ask it to work through the problem first and write down intermediate steps and only write the answer at the end. You should get better results than "wrong answer, then trying to justify it"
It does work through the problem, both with the logic, and with the number puzzle, providing all the intermediate steps necessary to solve it.

The problem here is that all the intermediate steps have serious mistakes in them. It's like asking a Markov chain to do algebra. There's numbers, and letters, and equals signs, and its all just word soup.

https://pastebin.com/Yy35m6um

You can try the dance problem with this prompt prefix

To answer the following problem, work through it by reasoning step by step and writing that reasoning down, making sure steps are not conflicting with previous steps. Only after you've written down all the steps, write down the final answer and base it on the previous steps.

I tried the dance problem, and regenerated the response three times.

The first two claimed that Michael will go to the dance, but third one made the correct argument that Michael may or may not go to the dance.

It didn't help it do any better on the number problem. Prepending that paragraph still has it get the first step is wrong, as well as everything that follows it.

> To solve this puzzle, we need to find the values of A, B, and C that satisfy the equation:

> AAA + BBB = AAAC

> Let's start by looking at the rightmost digit, which is C. We know that C must be either 0 or 1 because the sum of two digits cannot be greater than 18 (9 + 9 = 18). Also, C cannot be 0 because that would mean that A and B would be equal, which is not allowed in this puzzle. Therefore, C must be 1.

... And then it keeps going into la-la land.

The final answer it gives is, by the way:

957 + 483 = 1440

That's how it seems. That said, this seems like a very tractable problem to fix.
Try this prompt:

    For the following question, do all the calculations first and write them out. Only then answer the question based strictly on the result of the calculations. 
    
    What is heavier, 1000 tons of helium or a cubic meter of osmium?

Otherwise, you get a high-probability answer guided by the training data (which contains a lot of trick questions) and a bunch of attempts of the LLM to justify the wrong answer.