Hacker News new | ask | show | jobs
by stubybubs 1129 days ago
It's telling that it didn't pick up on the fact that the whole "once you look you can't touch the switches anymore" isn't in this version of the riddle. I mean the obvious strategy in this case is turn on the first switch, go upstairs, look at the bulbs. Go back downstairs and try the second switch. You've not got them all mapped out in about 30 seconds.

Using GTP4 I asked if there was a way to do it in less than 3 years, but it couldn't figure this out even if I told it you can look and use the switches as much as you want. Instead it suggested turning on a switch for 10 minutes, then using your "excellent alien vision" determine which 3 year lifespan bulb has 10 minutes of wear on it.

Makes me think GPT4 doesn't really have better reasoning, it just looks like better reasoning because it's been fed way more data.

1 comments

Variants of common riddles remain the final frontier. I'm trying to cross the river in a canoe with a carrot, cabbage, and cucumber...
Variants of common riddles actually can be solved with GPT-4, but you have to rewrite it so it doesn't look like the riddle from memory(sometimes, it's as easy as changing names to something completely different). Turns out Language models trust their memory quite a bit. Slightly related - they won't actually use the results of tools if it differs a lot from what it expects the output to be - https://vgel.me/posts/tools-not-needed/
"Language models trust their memory quite a bit."

All they have is memory, either in the weights or the input prompt. To the extent that these models appear to reason, it is precisely in the ability to successfully substitute information from the prompt into reasoning patterns in the training data. It shouldn't be any surprise that this fails when patterns in the prompt strongly condition the model to reproduce particular patterns of reasoning (eg, many words in the riddle indicate a well known riddle, but the details are different).

I know the impulse to anthropomorphize is almost impossibly seductive, but I find that the best way to understand and use these models is to remember: they are giant conditional probability distributions for the next token.

LLMs trained on code reason better. Perform better on reasoning benchmarks even if the benchmarks have nothing to do with code. You're wrong.

https://arxiv.org/abs/2210.07128

Code is often just a sequence of steps (sometimes with comments to indicate goals). As such, it is just another form of patterns of reasoning. Many chains of thought that you would utilize in code are useful skeletons to think about other things.

I don't see how this undermines my point.

If code transfers to reasoning tasks that don't have anything to do with code then what is being "substituted" ? Ideas and concepts ?

Code and MMLU don't share similar "reasoning patterns" unless you're being extremely vague. In the, "they both require reasoning" sense.

This really just makes it seem like it's not reason at all. The trick (or rather un-trick) here is that you can look at the bulbs as many times as you want. Even if I explicitly tell GPT4 that, it doesn't get it.

It's not reason, it's mapping.