Hacker News new | ask | show | jobs
by gavmor 1 day ago
A regular LLM acts as a "policy," mapping a current state to a specific action (states → actions). Their new LLM acts as a "world model," mapping a current state and a chosen action to a predicted future state ((states, actions) → subsequent states). Instead of deciding "what to do," its explicit objective is to predict the exact environment observation that will result from the interaction history and the agent's current action.

I assumed at first that it was trained on synthetic data, but they actually went and deployed real physical hosts and virtual machines (e.g. Ubuntu, macOS, and Android) and browsers. They ran agentic systems on these continuously and recorded the actual, real-world interactions.

So it's an LLM that infers next state, or outcome,as structured data e.g. literal HTML code, UI view hierarchies, or accessibility trees.

1 comments

So, if I'm reading this correctly, whereas a regular LLM would, given a prompt to edit a file, infer a sed call, this "world" model infers the resulting contents of the file.
Here's the demo: https://docs.qwenlm.ai/resources/mlu56_demo.html

Here's the description of the world model prompt for the web domain: "A precise GUI state simulator — given the current screen (as HTML) and a user action, predicts the exact next screen as a complete, self-contained HTML document." (You can click the world model prompt box to expand it and see the full prompt.)

So the world model generates the current state (an html document), an agent tells it what action it wants to perform, the world model generates the next state (another html document).

The other domains are similar, but w/ domain-specific nuance.

And a world model is useful for ... action space search which would require prediction?
It should improve agents' action selection by allowing them to evaluate actions' effects before performing them.

An agent using only a regular LLM has no real way to predict the results of its actions. It has to just take an action based on its training data and hope it's the right one. With a world model like this, it could do a second pass before each action to catch mistakes.

I don't know if this actually delivers yet, but if it does it might help make agents more usable.

Yeah, the fun part is the lookahead search, and here we are back in classical action-space fanout search, except I guess emulated in an LLM