| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mnky9800n 60 days ago
	This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench. https://github.com/mnky9800n/zork-bench

3 comments

kqr 60 days ago

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

[1]: https://entropicthoughts.com/updated-llm-benchmark

(more descriptions available in earlier evaluations referenced from there)

link

malfist 60 days ago

I'm going to ignore all that and tell my developers working in complicated codebases that they have to use AI. I'm sure comprehending side effects in a world building text adventure is completely different that understanding spaghetti code

link

red75prime 60 days ago

Desarcasmed version: "I think that problems with Zork make those models virtually useless in programming tasks." Correct?

link

cptskippy 60 days ago

He said complicated code bases. LLMs are great at producing small snippets of code to address very targeted problems.

link

red75prime 60 days ago

Great on small snippets of code, passable on larger pieces of code, great at finding vulnerabilities in large pieces of code, terrible in Zork. All-in-all, a jagged frontier that defies a simple sarcastic characterization.

link

girvo 59 days ago

Very kiki, not very bouba, as Aphyr rightfully stated.

link

seanmcdirmid 60 days ago

You can code your prompts to read and write an external world model on the side. This is what most people do who are seriously doing games with LLMs.

link

stingraycharles 60 days ago

What do you mean with this? What is this world model, what does it capture?

link

seanmcdirmid 60 days ago

You keep a document going called "state of the world", on every turn, you read this document in (as context), use it to help compute what happens, and based on what happens, create an updated "state of the world" document. You track important details so your LLM is consistent from turn to turn.

If you doing an RPG, which I guess is where this is more obvious, you track the play and enemy positions, their health, their moods and perhaps top thoughts, the state of important inanimate objects. if you break down the door, you update the door's state in the document. This is in contrast to just giving the LLM the previous turns and hoping it realizes the door is broken down later (just by statistical completion).

link

Schlagbohrer 59 days ago

I would love to see consistent-world-state-capturing more integrated into, for example, SillyTavern.

link

mnky9800n 60 days ago

we should talk. i sent you an email.

link

WarmWash 60 days ago

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

link

roenxi 60 days ago

Have the open models been tried? When I look at the leaderboard [0] the only qwen model I see is 235B-A22B. I wouldn't expect an MoE model to do particularly well, from what I've seen (thinking mainly of a leaderboard trying to measure EQ [1]) MoE models are at a distinct disadvantage to regular models when it comes to complex tasks that aren't software benchmark targets.

[0] https://arcprize.org/leaderboard

[1] https://eqbench.com/index.html

link

WarmWash 60 days ago

There is GLM 5 and kimi 2.5 (which gets 11.8%, but I digress)

link

CamperBob2 60 days ago

Actually the Zorks weren't deterministic, especially Zork II. The Wizard could F you over pretty badly if he appeared at an inopportune time.

link

mnky9800n 59 days ago

I feel like you are being pedantic. There are very few parts of Zork that are not static to the game. Yes the thief shows up randomly but that’s not the main point of the game.

link

CamperBob2 59 days ago

It is not the least bit pedantic. Games were meaner back then. If you're on a time (turn)-limited section of the game, or in a vulnerable spot like the volcano, random encounters with the wizard could render the game unwinnable without dying, which would completely wreck a benchmark. Same for the thief in Zork 1. If he randomly steals your light source, you're done for. Or if the RNG dictates that you lose the fight with the troll.

Can't recall anything like that in Zork 3. (Edit: apparently you could get shot randomly when using the time machine in the Royal Museum.)

link

Schlagbohrer 59 days ago

Was that using an RNG? Or is the entire game deterministic?

link

CamperBob2 59 days ago

It used an RNG. The usual practice back then was to spin a counter while waiting for keypresses, so that might affect the question when dealing with an external harness, I suppose.

link