| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrew_eu 684 days ago
	I thought with this chain-of-thought approach the model might be better suited to solve a logic puzzle, e.g. ZebraPuzzles [0]. It produced a ton of "reasoning" tokens but hallucinated more than half of the solution with names/fields that weren't available. Not a systematic evaluation, but it seems like a degradation from 4o-mini. Perhaps it does better with code reasoning problems though -- these logic puzzles are essentially contrived to require deductive reasoning. [0] https://zebrapuzzles.com

2 comments

slig 684 days ago

Hey, I run ZebraPuzzles.com, thanks for mentioning it! Right now I'm trying to improve the puzzles so that people can't "cheat" using LLMs so easily ;-).

link

andrew_eu 684 days ago

It's fantastic! Thanks for the great work.

link

slig 684 days ago

Thank you so much!

link

energy123 684 days ago

o1-mini does better than any other model on zebra puzzles. Maybe you got unlucky on one question?

https://www.reddit.com/r/LocalLLaMA/comments/1ffjb4q/prelimi...

link

andrew_eu 684 days ago

Entirely possible. I did not try to test systematically or quantitatively, but it's been a recurring easy "demo" case I've used with releases since 3.5-turbo.

The super verbose chain-of-reasoning that o1 does seems very well suited to logic puzzles as well, so I expected it to do reasonably well. As with many other LLM topics, though, the framing of the evaluation (or the templating of the prompt) can impact the results enormously.

link