Hacker News new | ask | show | jobs
Evaluating LLM Reasoning Through Live Computer Games (lmgame.org)
24 points by snyhlxde 492 days ago
10 comments

A dynamic human-in-the-loop evaluation benchmark is great for preventing data contamination and test saturation. Worth my time to read.
Just played the game and it seems pretty fun - especially when you see the LLM speaking nonsense haha
It's funny that reasoning models sometime speaking nonsense and perform worse than well-aligned models like claude-3.5-sonnet in multi-turn games like Akinator. I think it's one current weak point of applying longCoT RL vs. instruction-following alignment. Maybe we need to find a way to address both? Would be interesting to see some results
Static evaluation --> chatbot arena --> game arena. Seems to be promising.
I played the game and found hard mode to be an exciting challenge—it's incredibly fun, and the AI is so clever it even guessed my intentions in the taboo game!
This is pretty clever and seems to have high potential, but it still relies on humans. What if some day all humans cannot outsmart AI?
When super intelligence comes, it would be very interesting to see multi-party game play among AI too. What role humans play in this story is unclear. Maybe humans can't directly engage in the games neither as they are too naive and will be immediately identified and exploited by AI :)
So transparent.

https://news.ycombinator.com/item?id=43017857

zhisbug 1 day ago | next [–]

We hope to redefine ai evaluation via our gamified AI evaluation platform: game arena!

That is an interesting perspective that I hadn't previously considered, like using games to evaluate LLM.
Is it possible to setup a MLLM pipeline to play other roblox games and use that as another evaluation?
I think it's totally possible. Multimodal reasoning eval would be fun to consider too
Challenge yourself with latest reasoning LLMs and checkout our latest leaderboard!
Cool, actually not boring and hard to play, good gamification
great project