Hacker News new | ask | show | jobs
by cynicalpeace 936 days ago
Are you doing any extra work for the low latency? Or is it just those 3 API callouts (speech-to-text, text to response, response to voice) have gotten much faster on the third party side?
1 comments

The latency is dependent on those three APIs, but the biggest bottleneck is the GPT4 API. Its latency varies throughout the day, from <200ms to >1s. There are several application-level optimizations in Vibrato, like managing streaming audio and streaming text, but these aren't as impactful as the API latencies.
I wonder if there's some way to stream the GPT-4 response into the text to speech api and then stream that voice to the user. I don't think OpenAI's TTS API allows this, but if there were some API that could do this (or self-hosted model), you could give the appearance of being faster.
I don't think you can do that quite yet, since the TTS APIs require a full phrase in order to output fluent sounding speech. If the input is short, then the delivery/emotion/pauses are random per word/token. I actually think that type of system will be possible once we have a multimodal model that understands and outputs speech, with the intelligence of GPT4.