| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maccam912 958 days ago
	I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?

2 comments

ishan0102 958 days ago

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon

link

manmal 958 days ago

Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.

link

arbuge 958 days ago

Better watch token costs. The per token costs are lower now but even so a full context load still costs almost $4.

link