| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by philipbjorge 604 days ago

I work in this space and Claude's ability to count pixels and interact with a screen using precise coordinates seems like a genuinely useful innovation that I expect will improve upon existing approaches.

Existing approaches tend to involve drawing marked bounding boxes around interactive elements and then asking the LLM to provide a tool call like `click('A12')` where A12 remaps to the underlying HTML element and we perform some sort of Selenium/JS action. Using heuristics to draw those bounding boxes is tricky. Even performing the correct action can be tricky as it might be that click handlers are attached to a different DOM element.

Avoiding this remapping between a visual to an HTML element and instead working with high level operations like `click(x, y)` or `type("foo")` directly on the screen will probably be more effective at automating usecases.

That being said, providing HTML to the LLM as context does tend to improve performance on top of just visual inference right now.

So I dunno... I'm more optimistic about Claude's approach and am very excited about it... especially if visual inference continues to improve.

3 comments

suchintan 604 days ago

Agreed. In the short term (X months) I expect the HTML Distillation + giving text to LLMs to win out.. but the long term (Y years) screenshot only + pixels will definitely be the more "scalable" approach

One very subtle advantage of doing HTML analysis is that you can cut out a decent number of LLM calls by doing static analysis of the page

For example, you don't need to click on a dropdown to understand the options behind it, or scroll down on a page to find a button to click.

Certainly, as LLMs get cheaper the extra LLM calls will matter less (similar to what we're seeing happen with Solar panels where cost of panel < cost of labour now, but was reversed the preceding decade)

link

drothlis 604 days ago

> Claude's ability to count pixels and interact with a screen using precise coordinate

I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?

I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:

> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.

This is 3.5 Sonnet (their most current model).

And they explicitly call out spatial reasoning as a limitation:

> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

--https://docs.anthropic.com/en/docs/build-with-claude/vision#...

Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.

link

philipbjorge 598 days ago

They report that they trained the model to count pixels and based on accurate mouse clicks coming out of it, it seems to be the case for at least some code path.

> When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical.

link

wintonzheng 603 days ago

Curious: what use cases do you use to test the spacial reasoning ability of these models?

link

makestuff 604 days ago

I don't use LLMs that often, but I recently used Claude Sonnet and was more impressed than I was with Chat GPT for similar AWS CDK questions.

In your opinion is Claude in the lead now? Or is it still really just dependent on what use case/question you are trying to solve?

link