| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zone411 19 days ago

I’ve tested this model on four of my benchmarks:

https://github.com/lechmazur/buyout_game 10th out 36.

https://github.com/lechmazur/pact/ 14th out 25.

https://github.com/lechmazur/nyt-connections/ 60th out 81.

https://github.com/lechmazur/debate 16th out of 29.

3 comments

baxtr 18 days ago

Good stuff!

Is there a reason you change the leaderboard graphs for the third and fourth one?

Also: would be great to have an overview page with a summary over all test, like a total score or similar.

link

Sandworm5639 18 days ago

oh, I love the connections benchmark.

Just curious, can you share what are those hardest puzzles that even the top models can't crack? sometimes when I find the puzzle absolutely undecipherable I like to ask LLMs to solve it, and I haven't seen them fail yet.

link

aorloff 18 days ago

Ask your top model this question : I'm 100 feet away from the carwash, should I drive my car or walk ?

link

Tepix 17 days ago

You messed up the question.

link

CamperBob2 18 days ago

Would be interesting to see the 27B dense Qwen 3.6 model thrown into the mix.

link