Does this use levels from the original game or some custom ones? The solutions to the original levels should be in the training data, be it blogs, reddit comments, or wikis.
Unless the goal was to test how well do the large language models translate solutions in prose to actionable keyboard inputs, which is pretty interesting in itself.
Agreed. I’m surprised how often people seem to miss this. They don’t realize just how gargantuan the training datasets are for these large language models, especially for a very popular game like Baba Is You. I’m sure that both GameFAQs and the Steam forums are in the training data for any reasonably SOTA LLM both of which almost assuredly have complete walkthroughs for BIY.
I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.
I am happy to see an another approach - and indeed, with much stronger results.
That is very true but I was surprised by how clear the “signal” was. Only Gemini really confidently solved all levels. But yeah the goal is now to include harder levels as well!
Unless the goal was to test how well do the large language models translate solutions in prose to actionable keyboard inputs, which is pretty interesting in itself.