Thank you! The TTS and ASR are our own unreleased models, but we'll open-source them soon :)
The latency is about 500ms once we detect that it's the bot's turn to speak (roughly 200ms for the LLM's time-to-first token and 300ms for the TTS audio to start), plus a variable time for the semantic pause detection (VAD).
If it's clear that you're done talking, like when you ask a question, the model will reply very fast. If you stop mid-sentence as if you have more to say, it will wait for longer to avoid interrupting you.
The latency is about 500ms once we detect that it's the bot's turn to speak (roughly 200ms for the LLM's time-to-first token and 300ms for the TTS audio to start), plus a variable time for the semantic pause detection (VAD).
If it's clear that you're done talking, like when you ask a question, the model will reply very fast. If you stop mid-sentence as if you have more to say, it will wait for longer to avoid interrupting you.