If you look closely it actually does give multiple instructions per screenshot! However it cannot get too far, because the screen changes under it. For example when it starts typing a tweet, the tweet box expands and the send button moves, so it tries to click it but it's not longer there, it needs to take another screenshot to see because it's kinda executing those steps "in the dark"
we could try to patch an "interpolation" kinda of thing for change, but also, I'm curious to see if the multi-modal models that are coming out supporting video would be able to actually just "watch the video" in real time, this would be the ultimate solution
we could try to patch an "interpolation" kinda of thing for change, but also, I'm curious to see if the multi-modal models that are coming out supporting video would be able to actually just "watch the video" in real time, this would be the ultimate solution