Hacker News new | ask | show | jobs
by stevofolife 956 days ago
Am I correct to say this is a multi modal model using vision and audio?

What model is it? And how is it understanding the image and the question? Can anyone shed some light on this technical process?

1 comments

GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.
Ah got it! So basically the prompt to GPT4 is an image + text (converted from audio).