| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gaflo 2 days ago
	From how Simon described it it's not a native feature, but one that the model built as a solution for automatically testing. You could already instruct the agent to write a program that saves screenshots to disk and then reads it. As long as the model is multimodal (which pretty much all releases are these days) it can "natively" interpret images. There's probably a clever way to engineer this to be somewhat efficient, but for me it was rather token hungry, because the testing inputs and the description are usually quite verbose. I suppose you could use a weaker model for navigating the test and then only feed the output to the stronger model.