Evaluating LLM Reasoning Through Live Computer Games

Y	Hacker News new \| ask \| show \| jobs

	Evaluating LLM Reasoning Through Live Computer Games (lmgame.org)
	24 points by snyhlxde 492 days ago

10 comments

meru_2025 492 days ago

A dynamic human-in-the-loop evaluation benchmark is great for preventing data contamination and test saturation. Worth my time to read.

link

ginda307 492 days ago

Just played the game and it seems pretty fun - especially when you see the LLM speaking nonsense haha

link

snyhlxde 492 days ago

It's funny that reasoning models sometime speaking nonsense and perform worse than well-aligned models like claude-3.5-sonnet in multi-turn games like Akinator. I think it's one current weak point of applying longCoT RL vs. instruction-following alignment. Maybe we need to find a way to address both? Would be interesting to see some results

link

PY007 492 days ago

Static evaluation --> chatbot arena --> game arena. Seems to be promising.

link

Yuxuan_Zhang13 492 days ago

I played the game and found hard mode to be an exciting challenge—it's incredibly fun, and the AI is so clever it even guessed my intentions in the taboo game!

link

zhisbug 492 days ago

This is pretty clever and seems to have high potential, but it still relies on humans. What if some day all humans cannot outsmart AI?

link

snyhlxde 492 days ago

When super intelligence comes, it would be very interesting to see multi-party game play among AI too. What role humans play in this story is unclear. Maybe humans can't directly engage in the games neither as they are too naive and will be immediately identified and exploited by AI :)

link

elpocko 492 days ago

So transparent.

https://news.ycombinator.com/item?id=43017857

zhisbug 1 day ago | next [–]

We hope to redefine ai evaluation via our gamified AI evaluation platform: game arena!

link

mino1234uiui27 492 days ago

That is an interesting perspective that I hadn't previously considered, like using games to evaluate LLM.

link

wlsaidhi 492 days ago

Is it possible to setup a MLLM pipeline to play other roblox games and use that as another evaluation?

link

snyhlxde 492 days ago

I think it's totally possible. Multimodal reasoning eval would be fun to consider too

link

snyhlxde 492 days ago

Challenge yourself with latest reasoning LLMs and checkout our latest leaderboard!

link

leemack 492 days ago

Cool, actually not boring and hard to play, good gamification

link

flaciplam 492 days ago

great project

link