|
|
|
|
|
by _crowecawcaw
22 days ago
|
|
I also experimented with vision/screenshot based computer use tools for similar use cases but had inconsistent results. LLMs had trouble getting precise pixel coordinates from a screenshot to move a mouse. And the screenshots took extra tokens. I had a lot more success using accessibility APIs to replace screenshots + input simulation since accessibility data is easier for LLMs to process. The accessibility functionality is now released as a separate library for building automation tooling: https://xa11y.dev/ |
|