|
|
|
|
|
by philipbjorge
604 days ago
|
|
I work in this space and Claude's ability to count pixels and interact with a screen using precise coordinates seems like a genuinely useful innovation that I expect will improve upon existing approaches. Existing approaches tend to involve drawing marked bounding boxes around interactive elements and then asking the LLM to provide a tool call like `click('A12')` where A12 remaps to the underlying HTML element and we perform some sort of Selenium/JS action. Using heuristics to draw those bounding boxes is tricky. Even performing the correct action can be tricky as it might be that click handlers are attached to a different DOM element. Avoiding this remapping between a visual to an HTML element and instead working with high level operations like `click(x, y)` or `type("foo")` directly on the screen will probably be more effective at automating usecases. That being said, providing HTML to the LLM as context does tend to improve performance on top of just visual inference right now. So I dunno... I'm more optimistic about Claude's approach and am very excited about it... especially if visual inference continues to improve. |
|
One very subtle advantage of doing HTML analysis is that you can cut out a decent number of LLM calls by doing static analysis of the page
For example, you don't need to click on a dropdown to understand the options behind it, or scroll down on a page to find a button to click.
Certainly, as LLMs get cheaper the extra LLM calls will matter less (similar to what we're seeing happen with Solar panels where cost of panel < cost of labour now, but was reversed the preceding decade)