| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by theredsix 144 days ago

The difference is that we make browser use turn-based and return a single structured result per action.

With most other tools, the model is interacting with a live browser and effectively has to reason through a stream of low-level events while the page keeps changing. We instead freeze the page, let the model request one action, execute it, allow all resulting browser events to play out, then freeze again and return one bundled response with everything that happened plus the new stable page state.

So the model isn’t chasing a moving UI or event stream. It gets one grounded step at a time. A big part of the performance gain seems to come from that holistic action envelope.