| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by meffmadd 58 days ago
	I found LLMs to be surprisingly good at puzzle games like Baba Is You: https://meffmadd.github.io/samplesurium/posts/baba_is_agent/

3 comments

duckmysick 58 days ago

Does this use levels from the original game or some custom ones? The solutions to the original levels should be in the training data, be it blogs, reddit comments, or wikis.

Unless the goal was to test how well do the large language models translate solutions in prose to actionable keyboard inputs, which is pretty interesting in itself.

link

vunderba 58 days ago

Agreed. I’m surprised how often people seem to miss this. They don’t realize just how gargantuan the training datasets are for these large language models, especially for a very popular game like Baba Is You. I’m sure that both GameFAQs and the Steam forums are in the training data for any reasonably SOTA LLM both of which almost assuredly have complete walkthroughs for BIY.

link

stared 58 days ago

Nice!

I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.

I am happy to see an another approach - and indeed, with much stronger results.

link

meffmadd 58 days ago

Yes that was the post that inspired me to build this.

While I did implement a more comprehensive harness with path finding tools etc. the models themselves have improved significantly.

link

stared 57 days ago

I saw you mentioned that!

Anyway: did you test it with Claude Opus 4.8?

link

IsTom 58 days ago

I think what should be kept in mind is that these are not the hard levels BIY is famous for.

link

meffmadd 58 days ago

That is very true but I was surprised by how clear the “signal” was. Only Gemini really confidently solved all levels. But yeah the goal is now to include harder levels as well!

link