Freezing the browser at every step is a very good approach. I am also working on an agent browser. It uses wireframe snapshots instead of screenshots to reduce token cost.
https://github.com/agent-browser-io/browser
Your tool's method of returning element references is clever and should greatly improve llm handling of the page components (and greatly reduce token cost).
Your tool's method of returning element references is clever and should greatly improve llm handling of the page components (and greatly reduce token cost).