Hacker News new | ask | show | jobs
by maccam912 958 days ago
I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?
2 comments

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon
Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.
Better watch token costs. The per token costs are lower now but even so a full context load still costs almost $4.