Hacker News new | ask | show | jobs
by gangster_dave 936 days ago
The latency is dependent on those three APIs, but the biggest bottleneck is the GPT4 API. Its latency varies throughout the day, from <200ms to >1s. There are several application-level optimizations in Vibrato, like managing streaming audio and streaming text, but these aren't as impactful as the API latencies.
1 comments

I wonder if there's some way to stream the GPT-4 response into the text to speech api and then stream that voice to the user. I don't think OpenAI's TTS API allows this, but if there were some API that could do this (or self-hosted model), you could give the appearance of being faster.
I don't think you can do that quite yet, since the TTS APIs require a full phrase in order to output fluent sounding speech. If the input is short, then the delivery/emotion/pauses are random per word/token. I actually think that type of system will be possible once we have a multimodal model that understands and outputs speech, with the intelligence of GPT4.