Hacker News new | ask | show | jobs
by xelia 919 days ago
I wonder if this can be optimized by letting GPT provide multiple instructions per screenshot instead of just one.

For example in the twitter screenshot, it could use just the one image.

1 comments

If you look closely it actually does give multiple instructions per screenshot! However it cannot get too far, because the screen changes under it. For example when it starts typing a tweet, the tweet box expands and the send button moves, so it tries to click it but it's not longer there, it needs to take another screenshot to see because it's kinda executing those steps "in the dark"

we could try to patch an "interpolation" kinda of thing for change, but also, I'm curious to see if the multi-modal models that are coming out supporting video would be able to actually just "watch the video" in real time, this would be the ultimate solution