Hacker News new | ask | show | jobs
by adamsiem 399 days ago
Anyone using vision to parse screenshots? QVQ was too slow. Will give this a shot.
2 comments

I used molmo to parse screenshots in order to detect locations of UI elements. See the repo below. I think Omni parser from Microsoft would also work well.

https://github.com/logankeenan/george

https://github.com/microsoft/OmniParser

You might be interested in https://github.com/OpenAdaptAI/OpenAdapt