| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by obscur 995 days ago
	GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.

1 comments

Ah got it! So basically the prompt to GPT4 is an image + text (converted from audio).