| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by _crowecawcaw 22 days ago
	I also experimented with vision/screenshot based computer use tools for similar use cases but had inconsistent results. LLMs had trouble getting precise pixel coordinates from a screenshot to move a mouse. And the screenshots took extra tokens. I had a lot more success using accessibility APIs to replace screenshots + input simulation since accessibility data is easier for LLMs to process. The accessibility functionality is now released as a separate library for building automation tooling: https://xa11y.dev/

1 comments

cool! thank you for sharing - will check it out