Are you doing any extra work for the low latency? Or is it just those 3 API callouts (speech-to-text, text to response, response to voice) have gotten much faster on the third party side?
The latency is dependent on those three APIs, but the biggest bottleneck is the GPT4 API. Its latency varies throughout the day, from <200ms to >1s. There are several application-level optimizations in Vibrato, like managing streaming audio and streaming text, but these aren't as impactful as the API latencies.
I wonder if there's some way to stream the GPT-4 response into the text to speech api and then stream that voice to the user. I don't think OpenAI's TTS API allows this, but if there were some API that could do this (or self-hosted model), you could give the appearance of being faster.
I don't think you can do that quite yet, since the TTS APIs require a full phrase in order to output fluent sounding speech. If the input is short, then the delivery/emotion/pauses are random per word/token. I actually think that type of system will be possible once we have a multimodal model that understands and outputs speech, with the intelligence of GPT4.