|
|
|
|
|
by joshstrange
410 days ago
|
|
> where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning. I'm not an expert on LLMs but that feels completely counter to how LLMs work (again, _not_ an expert). I don't know how we can "stream" the input and have the generation update/change in real time, at least not in 1 model. Then again, what is a "model"? Maybe your model fires off multiple generations internally and starts generating after every word, or at least starts asking sub-LLM models "Do I have enough to reply?" and once it does it generates a reply and interrupts. I'm not sure how most apps handle the user interrupting, in regards to the conversation context. Do they stop generation but use what they have generated already in the context? Do they cut off where the LLM got interrupted? Something like "LLM: ..and then the horse walked... -USER INTERRUPTED-. User: ....". It's not a purely-voice-LLM issue but it comes up way more for that since rarely are you stopping generation (in the demo, that's been done for a while when he interrupts), just the TTS. |
|
The only model that has attempted this (as far as I know) is Moshi from Kyutai. It solves it by having a fully-duplex architecture. The model is processing the audio from the user while generating output audio. Both can be active at the same time, talking over each other, like real conversations. It's still in research phase and the model isn't very smart yet, both in what it says and when it decides to speak. It just needs more data and more training.
https://moshi.chat/