Hacker News new | ask | show | jobs
by zone411 19 days ago
I’ve tested this model on four of my benchmarks:

https://github.com/lechmazur/buyout_game 10th out 36.

https://github.com/lechmazur/pact/ 14th out 25.

https://github.com/lechmazur/nyt-connections/ 60th out 81.

https://github.com/lechmazur/debate 16th out of 29.

3 comments

Good stuff!

Is there a reason you change the leaderboard graphs for the third and fourth one?

Also: would be great to have an overview page with a summary over all test, like a total score or similar.

oh, I love the connections benchmark.

Just curious, can you share what are those hardest puzzles that even the top models can't crack? sometimes when I find the puzzle absolutely undecipherable I like to ask LLMs to solve it, and I haven't seen them fail yet.

Ask your top model this question : I'm 100 feet away from the carwash, should I drive my car or walk ?
You messed up the question.
Would be interesting to see the 27B dense Qwen 3.6 model thrown into the mix.