| HN Mirror

Thank you! The TTS and ASR are our own unreleased models, but we'll open-source them soon :)

The latency is about 500ms once we detect that it's the bot's turn to speak (roughly 200ms for the LLM's time-to-first token and 300ms for the TTS audio to start), plus a variable time for the semantic pause detection (VAD).

If it's clear that you're done talking, like when you ask a question, the model will reply very fast. If you stop mid-sentence as if you have more to say, it will wait for longer to avoid interrupting you.