Hacker News new | ask | show | jobs
by shnkr 828 days ago
the moment I saw vision in the title I knew what was going on. it was first demoed[0] by AI Jason around 4 months back. is it any different?

https://m.youtube.com/watch?v=IXRkmqEYGZA

1 comments

Love this video

> self-operating-computer This is quite different than https://github.com/OthersideAI/self-operating-computer

Self-operating-computer uses pixel mapping to control your computer. This is a very good approach, but it's extremely unreliable. GPT-4V frequently hallucinates pixel outputs, causing it to miss interactions, or enter fail-loops

>The approach by AI Jason

AI Jason is using image-only methods to interact with the browser. This is a great first step, but this approach tends to be rife with hallucinations or errors. We do dom parsing in addition to image anaylsis to help GPT-4V correlate information in the image to the interactable elements within the DOM. This dramatically boosts its ability to perform the same task over and over again reliably (which proved impossible with the image-only approach)

nice. I was looking for simpler hacks as V didn't scale for me. Later I couldn't find time and this got back burnered.

interesting concept for problem solving though. congrats!

Thanks! We definitely experimented with V only (that's the dream), but there's too much context missing:

1. What's behind a select option? You don't know until you click it, which means you need another iteration. This sucks. 2. How do you consistently correlate things in the images to actual actions (ie upload a file to a file input, click on a button, insert a date into a date)? Having the additional HTML Tag information dramatically improves the action selection process (click vs upload vs type)