| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Uehreka 624 days ago

GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not.

Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather.

By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”.

1 comments

godelski 624 days ago

Oh, thanks for the correction. I did misinterpret.

Though I will say that LLMs don't appear to be doing any better at the river crossing puzzles. They tend to "patch" the ones I and others actively tweet about but they still aren't becoming better at generalizing. I've taken this as fairly strong evidence as we're going in the wrong direction of reasoning (as opposed to similar direction). But the strongest evidence to me is that they're entropy minimizers.

What's extra interesting, is transformers CRAVE augmentations. I work in vision and this is a necessary thing to get them to do well. You can actually get much smaller models to do what bigger models can if you get this right.

link

joe_the_user 623 days ago

Oh, thanks for the correction. I did misinterpret.

Illustrates language is hard for human too, hah.

Anyway, the "next iteration solves it" effect is definitely a result of common problems leaking. But it could also be a result of LLM being universal but not efficiently-universal problem solvers and people tending to choose the simplest problem that can't be solved (such theories seem illustrative).

Also, your river-crossing problems seem quite useful.

link

godelski 622 days ago

  > hah

And? That's not what's the issue with LLMs.

The issue is an inability to reason. Sure, a human might also have difficulties with river crossing problems, even trivial ones, but I can't get a person to tell me that all animals can fit in the boat, to then put all the animals into the boat, and then proceed to make multiple trips across the river. If they get the first two they always get the right answer. But this is not true for an LLM. That's a very clear demonstration of a lack of reasoning and a lack of having a world model.

It's not about coaching or finding the right prompt, it's that the logic is inconsistent and unreasonable (yes, humans will fail at logic, but *reasoning doesn't mean correct answer*). It fails to meet the basic definition of reasoning.

The whole fucking goal is generalization. That's the G in AGI and the most important thing in all 3 of those letters. We don't have strong evidence of generalization. For GI we want out of distribution generalization but we're not doing so well at in distribution generalization. That's demonstrated by the river crossing puzzles, Cheryl's birthday, and the recently famous 9.8 vs 9.11 (https://x.com/sainingxie/status/1834300251324256439)

Yes, next iteration will get better. But better in which direction. Being dismissive of what it fails at just means you don't get better at that direction unless you get lucky.

link

fragmede 624 days ago

Oh interesting, what variants were you using that it seems they patched?

link

godelski 624 days ago

Well it depends on what you mean patched, but I'll be generous and say that it gets the right answer and there are no logical errors. I used to use

  Which weighs more, a kilogram of bricks or a pound of feathers

and that worked until the second or third update on GPT 3.5 (when everyone was like "but did you try 3.5?)

Here's a river crossing puzzle with the current GPT-4o that "works"

  ----------------------------------------------------
  > I have a duck, a goose, and a pig that I need to take across a river. If I live the duck and goose alone, the duck will quack. If I leave the pig with the goose the pig will oink. I can only fit two animals in the boat at a time. What is the quickest way to get all animals across the river?

  To solve this puzzle, we need to find a way to transport the animals across the river without triggering the quacking or oinking conditions:

  1. **First Trip:** Take the duck and the goose across the river. Leave the goose on the other side and take the duck back with you.
  2. **Second Trip:** Take the pig across the river. Leave the pig on the other side and bring the goose back with you.
  3. **Third Trip:** Take the duck and the goose across the river again.

  Now all three animals are across the river, and at no point were the duck and goose left alone, nor were the pig and goose left together. This method ensures no quacking or oinking occurs. 
  ----------------------------------------------------

Idk why it decided to add the extra constraint of keeping the animals quiet but hey, it doesn't really matter.

As far as I know, I'm the first one that came up with this variant and strangely it has been consistently answered correctly when 4o came out (after my post got repeated by a few bigger names) so I thought it was patched, but just now I wasn't able to reproduce a success with various checkpoints and system prompts, so I guess it got undone?

  ----------------------------------------------------
  > I have a fox, a goose, and a bag of corn that I need to transport across a river. All three can fit in the boat. If you leave the fox alone with the goose, it'll eat the goose. If you leave the goose alone with the corn, it'll eat the corn. Describe how to get all items safely across the river.

  To solve this classic puzzle, follow these steps:

  1. **Take the Goose Across First:** Start by taking the goose across the river and leave it on the other side.

  2. **Return and Take the Fox Across:** Go back to the original side and take the fox across the river.

  3. **Bring the Goose Back:** When you reach the other side, leave the fox there and take the goose back with you to the original side.

  4. **Take the Corn Across:** Leave the goose on the original side and take the bag of corn across the river.

  5. **Return Alone:** Leave the bag of corn with the fox on the other side and return alone to the original side.

  6. **Take the Goose Across Again:** Finally, take the goose across the river one last time.

  Now, all three items—the fox, the goose, and the corn—are safely across the river without any being eaten.
  ----------------------------------------------------

claude-3-5-sonnet-20240620 and claude-3-opus-20240229 both act like it is the normal puzzle.

It doesn't matter if you say "All three can fit in the boat with the farmer." Nor if you ask it to explain its logic step-by-step.

The only humans I've seen struggle with this variant ask me to repeat the question several times before saying something similar to "I don't understand the trick. You just take all the animals across at once?" But the trick is there's no trick and I accept that as an answer lol.

FWIW, I've had several LLMs explain to me that all animals can fit in the boat and then proceed to do the normal solution to the puzzle. I've also seen it put all animals into the boat and then only drop one off and travel back and forth, or sometimes animals teleporting or even the framer. Sometimes the farmer is just going back and forth. That happens a lot when I use a variant where animals will follow the farmer and can swim.

Very often I see the better models great at getting the right answers but fail (or even worse) when explaining that answer. I don't think this makes the models useless, but I feel that their highest utility is mostly due to Google's ever decreasing quality. But what can you do?

link

fragmede 624 days ago

I like the noise variant! The "everything can in one trip" variant is the one I've been using, and I was able to get 4o to get it right in one shot with enough couching, and o1-preview without couching, which convinced me (of what, I'm not sure). my other riddle is the car accident doctor son one, which 4o couldn't get but o1-preview does.

I'll have to come up with more obscure riddles and not talk about them online and only use temporary chats which aren't used as training data and see what happens next. I'm sure I have a puzzle book in my library that I can use to help me make new ones.

link

godelski 624 days ago

Be careful with coaching. It's very easy to leak information. The point is to get it without letting it know.

As for o1, well I've been using this for a year and a few big players have used it too. So remember that they get spoiled because they end up in the training set

link

fragmede 623 days ago

good point! problem is, I can't know what other people have spoiled it on either, so if we'd independently come up with the now spoiled "the boat can take all" variant, I can't know unless that gets revealed over Twitter or arvix or HN or wherever.

link