Hacker News new | ask | show | jobs
by obscur 949 days ago
GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.
1 comments

Ah got it! So basically the prompt to GPT4 is an image + text (converted from audio).