| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by antx 351 days ago
	Also with the rapid advances of vision language models, I would be surprised if we don't see image-to-text-to-voice system that works with real-time video in a not-so-far future! Like a reverse "Genie" where instead of providing a prompt and it generates a world, you provide a streaming video and it spouts relevant information when changes happen, or on demand, for instance...

1 comments

gostsamo 351 days ago

It would be great to have it as a backup, but it will always be the heaviest in computation and responsiveness solution so it should be the last one used.

link

fho 351 days ago

Have you played around with the current vision features? I am pretty sure even gpt-4.1 can give you pretty good descriptions of e.g. screen captures, including being able to "read" and reproduce text.

link

gostsamo 350 days ago

yes, there are multiple addons giving screen readers the ability to prompt ai-s for image recognition. they work rather well, btw, though the value is often situational. agentic behavior might help further, though it will need some polishing.

link