|
|
|
|
|
by cubefox
453 days ago
|
|
That's nice, but the main problem with current voice turn-taking is different. It's that these systems don't know when it is their turn to speak. When a human speaks to another, the second person will listen and interpret and guess when the first person is finished talking. For voice agents it doesn't work that way at all. The text-to-speech system just seems to have a hardcoded "pause" detector, e.g. 2 seconds, and if 2 seconds of silence are ever detected, the "end of message" token is sent and the LLM will start talking. Even if you were just collecting your thoughts and weren't finished at all. So the semantic content of what you are saying is completely ignored for turn-taking and no analysis takes place which would determine whether the user is likely to have said everything they wanted to say. Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over". Which was of course common in half-duplex radio where only one person could transmit. LLMs are half-duplex too: they can't listen and talk at the same time. |
|
That doesn’t sound very conversational at all. Instead one could train the network to recognise the appropriate turn-taking points.
The simple way to do that is to make the model output a “listen a bit more” token when it is not yet their turn to talk. You can use real life recorded conversations to build up the initial training set, and then add more data where clashes happen (where tha AI and the speaker speaks at the same time over each other.)
More complicated would be a system where the model is periodically fed the audio chunk so far, and the model predicts what the speaker is likely going to say and based on that when it is appropriate to respond and with wath. And then a smaller, faster, local model can be used to verify if what was said matches the prediction, and if so outputs the generated response. If there is a mismatch it engages the more expensive model to come up with a new prediction.
If you engineer this right you can reuse the state vector from save points and save a bit of compute that way.
Asking the user to say “over” at the end of their turn is the most heavy handed solution. Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.