Hacker News new | ask | show | jobs
by meffmadd 10 days ago
I found LLMs to be surprisingly good at puzzle games like Baba Is You: https://meffmadd.github.io/samplesurium/posts/baba_is_agent/
3 comments

Does this use levels from the original game or some custom ones? The solutions to the original levels should be in the training data, be it blogs, reddit comments, or wikis.

Unless the goal was to test how well do the large language models translate solutions in prose to actionable keyboard inputs, which is pretty interesting in itself.

Agreed. I’m surprised how often people seem to miss this. They don’t realize just how gargantuan the training datasets are for these large language models, especially for a very popular game like Baba Is You. I’m sure that both GameFAQs and the Steam forums are in the training data for any reasonably SOTA LLM both of which almost assuredly have complete walkthroughs for BIY.
Nice!

I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.

I am happy to see an another approach - and indeed, with much stronger results.

Yes that was the post that inspired me to build this.

While I did implement a more comprehensive harness with path finding tools etc. the models themselves have improved significantly.

I saw you mentioned that!

Anyway: did you test it with Claude Opus 4.8?

I think what should be kept in mind is that these are not the hard levels BIY is famous for.
That is very true but I was surprised by how clear the “signal” was. Only Gemini really confidently solved all levels. But yeah the goal is now to include harder levels as well!