Hacker News new | ask | show | jobs
by kqr 10 days ago
They aren't good at Zork[1] and neither at newer and/or more obscure text adventures[2].

[1]: https://www.lowimpactfruit.com/p/zork-bench-an-llm-reasoning...

[2]: https://entropicthoughts.com/evaluating-llms-playing-text-ad...