Hacker News new | ask | show | jobs
by cheeko1234 1010 days ago
It failed my basic benchmark that only GPT4 passes:

Q: I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

Falcon180B: First, fill the 6-liter jug completely. Then, pour it into the 12-liter jug. Fill the 6-liter jug again, and pour it into the 12-liter jug until the 12-liter jug is full. At this point, there will be exactly 6 liters of water in the 6-liter jug, which you can then use.

Q: Using only a 6-liter jug, how can I get exactly 6 liters of water?

Falcon180B: You cannot get exactly 6 liters of water in a 6-liter jug because it would already be full. However, if you had another container that could hold at least 6 liters of water, you could follow these steps:

1. Fill the 6-liter jug with water. 2. Pour the water from the 6-liter jug into the other container. 3. Refill the 6-liter jug with water and pour it into the other container. 4. Repeat step 3 until you have poured 6 liters of water into the other container.

This way, you can get exactly 6 liters of water by combining the capacity of the 6-liter jug and the other container.

9 comments

Probably every single thread comparing gpt (3.5 or especially 4) needs a copy paste caveat that it's not really fair to compare a "bare" model like falcon (not even its instruct version) with chatgpt _service_ the service part is crucial, because it includes advanced features such as "secret sauce" output sampling (while on huggingface all models by default use basic greedy sampling). No preprompt, depending on subject detection and so on. For a fair comparison we would need to compare with the exact same prompt and exact same output sampling. And tgats not going to happen with (not at all) OpenAI
> "secret sauce" output sampling

are you referring to beam search? something else?

Beam search is well known. I mean strategies like beam search, but one's we don't know about.

I can imagine some, for example like beam search but you score every option with a smaller model. Of course one can say "but we see every token as it streams" to which I might say, are you sure? Perhaps they generate a hundred entire responses in the time it takes for one token to be shown. They just "stream" those tokens so slow to make it more "human pace" oriented.

interesting. but there should be physical limits to that that we can handicap to put bounds on speculation. so for example, FLOPS/s has an upper bound and you can make latency estimates for 1/10/100B models. this would put reasonable bounds for statements like "a hundred entire responses in the time it takes for one token to be shown"
I just asked GPT-4 your first question and got this in response. It's still a correct result, but with unnecessary steps:

> Fill the 12-liter jug and pour water into the 6-liter jug until it's full. Empty the 6-liter jug and pour the remaining water from the 12-liter jug into it. You'll have 6 liters in the 6-liter jug.

My exchange:

Me:

  I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?
ChatGPT:

  Using the 12-liter jug and the 6-liter jug, you can get exactly 6 liters of water in the following way:

  1. Fill the 12-liter jug to its maximum capacity.

  2. Pour the water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. This will leave you with 6 liters of water in the 12-liter jug.

  Now you have 6 liters of water in the 12-liter jug.
Me:

  Can you think of a simpler way to achieve this?
ChatGPT:

  Yes, a simpler way to achieve this is:

  1. Fill the 6-liter jug to its maximum capacity.

  That's it. You now have 6 liters of water in the 6-liter jug!
I wish I understood why prompting it a second time gets a better answer.
I bet when you said a 12 litre jug and a 6 litre jug it wrongly assumed that you required it to actually make use of both the jugs in some way (not merely that they were available for possible use), leading to the pointless step.
Seems right! If you make it more of an inventory list of tools, it answers correctly.

> I have two jugs: a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

> GPT-4: If you just need 6 liters of water and you have a 6-liter jug, you simply fill the 6-liter jug to the top with water. You'll have exactly 6 liters! No need to use the 12-liter jug in this case.

This video covers the concept pretty well: https://www.youtube.com/watch?v=IJEaMtNN_dM

It is pretty normal to try to incorporate the extranneous details into the reply.

I would bet a high percentage of humans would do the same thing if prompted as such.
I've noticed that the LLMs are all tuned to emit corporate speak.

Everyone I've encountered that adds lots of obfuscating and tangential details to their day-to-day speech (and tries to establish that particular tone of faux-inclusivity and faux-authority) has turned out to be a sociopath and/or compulsive liar. I find it interesting that LLMs have the same symptom and underlying problem.

Isn't the right answer just fill the 6-liter jug? I don't get it. Is it supposed to be a trick question?
What about the ketchup test? Ask it to tell you how many times the letter e appears in the word ketchup. Llama always tells me it's two.
Spelling challenges are always going to be inherently difficult for a token-based LM. It doesn't actually "see" letters. It's not a good test for performance (unless this is actually the kind of question you're going to ask it regularly).
I've found it's more reliable to ask it to write some javascript that returns how many letters are in a word. Works even with Llama 7b with some nudging.
Falcon fails. GPT-3.5 also fails this test. GPT-4 gets it right. I suspect that GPT-4 is just large enough to have developed a concept of counting, whereas the others are not. Alternatively, it's possible that GPT-4 has memorized the answer from its more extensive training set.
It's not possible to count letters for an LLM; it only "sees" tokens.
Bard can also give correct result
Is this supposed to be a trick question? How can it be a good question for testing an AI if a human cannot understand it either?

I think if you ask this question on different websites (to humans) you will get many different and confused answers. So why bother asking an AI? I don't even know what the right answer is.

I don’t think this is a particularly useful benchmark.

It’s well known that LLMs are bad at math. The token based weighting can’t properly account for numbers that can vary wildly. Numbers are effectively wildcards in the LLM world.

Surely this is a "didn't read the question properly" problem rather than a "didn't maths right" problem?

And that (understanding a natural language question) is the USP for LLMs.

I don't buy it. In any common usdage "6 liter jug" means a jug capable of holding 6 liters, not with a volume of 6 liters including the walls.
I don't understand your comment. Falcon said that it's impossible to measure 6 liters of water with a 6 liter jug.
Surely the reason LLMs fail here is because this is an adaptation of a common word problem, except your version has been tweaked so that there is a trivial answer.
Yes, that's the point of the question. We want to know if it's actually doing some reasoning, or if it has just memorized an answer.
It's the latter. For every LLM out there. They are trained to memorize, not reason. It will take radically different training techniques to make these networks reason in a human-like way.
Memorising is so trivial we've been doing it by default since forever, regardless of if that means magnetic core memory, the Jacquard Loom, the Gutenberg press, the ceramic movable type China had for a few centuries before Gutenberg, or using a stick to smudge words into soft clay tablets that were accidentally made permanent by a house fire.

AI like this aren't just memorisation.

They almost certainly don't think like us — even if they did at a low level, the training regime would take the equivalent of hundreds of human lifetimes, and the number of parameters in the larger models is a thousandth of the number in a human brain.

Then how do you explain zero-shot performance?
This does not look like a good benchmark test for an LLM capability.
I, a human, have no idea how to answer this weird question, why do you suppose an AI would do better?

I can’t work out if it’s a joke question or a serious question?