| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stevofolife 956 days ago
	Am I correct to say this is a multi modal model using vision and audio? What model is it? And how is it understanding the image and the question? Can anyone shed some light on this technical process?

1 comments

obscur 956 days ago

GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.

link

stevofolife 956 days ago

Ah got it! So basically the prompt to GPT4 is an image + text (converted from audio).

link