| Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on). An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences". It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects. Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone. Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it". Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence. This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn. Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user. |