Hacker News new | ask | show | jobs
by _crowecawcaw 22 days ago
I also experimented with vision/screenshot based computer use tools for similar use cases but had inconsistent results. LLMs had trouble getting precise pixel coordinates from a screenshot to move a mouse. And the screenshots took extra tokens. I had a lot more success using accessibility APIs to replace screenshots + input simulation since accessibility data is easier for LLMs to process. The accessibility functionality is now released as a separate library for building automation tooling: https://xa11y.dev/
1 comments

cool! thank you for sharing - will check it out