Hacker News new | ask | show | jobs
by ton4eg 395 days ago
Way more entertaining than I would expect! What TTS and ASR models do you use? What sort of latency do you get?
1 comments

Thank you! The TTS and ASR are our own unreleased models, but we'll open-source them soon :)

The latency is about 500ms once we detect that it's the bot's turn to speak (roughly 200ms for the LLM's time-to-first token and 300ms for the TTS audio to start), plus a variable time for the semantic pause detection (VAD).

If it's clear that you're done talking, like when you ask a question, the model will reply very fast. If you stop mid-sentence as if you have more to say, it will wait for longer to avoid interrupting you.