|
|
|
|
|
by kai2006
213 days ago
|
|
Looks really cool. It feels like a response take about 3 seconds once the UI switch from "listening" to "thinking" to get a response played on my headphones (bluetooth, so maybe that add latency). Something feels a bit canny when I don't say anything yet, and the AI persona look dead straight into the camera smiling at me. What tech stack are you using under the hood? |
|
Yea that latency makes sense; "listening" includes turn detection and STT, "thinking" LLM + TTS _and then_ our model, so the pipeline latency stacks up pretty quick. The actual video model starts streaming out frames <500ms from the TTS generation, but we're still working on reducing latency from parts of the pipeline that we are using off the shelf.
We have a high level blog post here https://www.keyframelabs.com/blog/persona-1 about the architecture of the video model, the WebRTC "agent" stack is Livekit + a few backend components hosted in Modal.