Hacker News new | ask | show | jobs
by WellingtonWells 207 days ago
I'm kinda curious how a VLM would do -- better spatial reasoning but worse planning? I don't use an AI web browser, but I'd be curious to know what happens if you throw something like OpenAI Atlas at the game's webpage.
1 comments

So there are a couple of papers that try to use LLMs for UI-based enterprise task benchmarking like WorkArena++(ServiceNow) where the agent has to solve a couple of relatively simple enterprise tasks (like creating incident tickets based on some criteria that has to be determined by the agent etc). This benchmark in particular had quite low accuracy numbers especially on the more composite tasks. Curious about the OpenAI Atlas thing too.